Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

PagerDuty and Arize: Integrations for ML Observability

Arize is an ML Observability platform aimed to detect, troubleshoot, and eliminate ML problems faster. Use Arize to monitor your production models and send alerts to PagerDuty when your models deviate from a certain threshold. Arize and Pagerduty help keep your teams in the loop, send more comprehensive metadata through alerts, and debug your models faster than ever before.

A new channel per incident - helpful or harmful?

I caught the tail-end of a Twitter thread the other day which centred around the use of Slack channels for incidents, and whether creating a new channel for each new incident is helpful or harmful. It turns out this is a much more evocative subject than I thought, and since I have opinions I thought I’d share them!

Uptime + Squadcast Integration: Routing Alerts Made Easy

Uptime is a site monitoring solution used to reach various endpoints & notify users via push notifications when downtime is detected. It collects and stores downtime & response time data & which is then made available as reports to the users. If you use Uptime for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Uptime to the right users in Squadcast. The below steps will help you set up Uptime and Squadcast integration.

That Rogers Outage is Going to be More Expensive Than You Think

On July 8 of 2022, the Canadian telecom company Rogers Communications suffered a major outage that impacted most of Canada for almost two days. This wasn’t completely unprecedented (they’d had an outage in 2021 that impacted their wireless servers for several hours) but the breadth and severity of this one is going to end up costing them far, far more than it seems at first glance.

See the big picture with the Service Dependency Graph

Understanding the impact and scope of an incident when degradation occurs is critical for returning your service online. This requires modeling the many downstream and upstream relationships between your services. Our new Service Dependency Graph provides a shortcut – a way to surface dependencies quickly, understand the relationship between services, and determine the scope or impact of an incident.

UBS invests in BigPanda to help drive digital disruption and innovation in AIOps

UBS is one of the leaders in the financial sector and one of the early adopters that are levering AI to do things better, cheaper and faster to bring their IT Operations in line with their cloud migration and digital transformation strategy. BigPanda is thrilled to have UBS as a customer and an investor to drive real transformation.

August 2022 Update - Change duty status of colleagues, configurable duty notifications and revised password change

Our August update now allows administrators and team administrators to change the service status of other users in the portal. We also made service settings more granular and e.g. introduced the ability to turn off certain push messages when colleagues’ service statuses change. We have also revised the way of changing personal password or remote action PIN in the portal. All details are available in this article.

RESOLVE '22: The SOC and the NOC

In our RESOLVE ’22 event The SOC and the NOC, moderator and 3 Tree Tech VP of Cybersecurity Kris Taylor welcomed two esteemed guests to the stage: As Kris noted at the top of the event, we brought our panelists together to talk about “the culture of the network operating center (NOC) and security operations center (SOC).” Along the way, they discussed different philosophical and practical takes on the high-level topics of networking and security.