Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own? That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud.

Whiskey and Wisdom: Justifying AIOps

Whiskey and Wisdom is a monthly executive-only forum where IT Operations leaders can network independently and discuss high-level AI operations and IT Ops strategies with their industry peers. In our most recent session, the discussion was around justifying AIOps—proving the value the technology brings to the table.

Incident Commanders: where are they now?

BigPanda gives the Incident Commander award to IT Ops superstars—people who go above and beyond in this high-pressure, critical line of work. In 2021, Ben Narramore, Director of Operations/Service Management at PlayStation was a recipient for his ability to handle high-impact global incidents with exemplary professionalism and skill. Let’s find out what he’s been up to…

IT outages are a fact of life - it's how you handle them

In the IT world, outages and service disruption are a fact of life. Stuff hits the fan… Stuff happens! And it can happen to any service provider – even the most well designed and managed SaaS applications and platforms. One of the reasons why stuff happens is failing to adhere to best practices. To minimize the potential for problems, here we run over some of the key points from the cloud platform management best practice playbook.

How to Make Your Incident Response Plan with Mattermost

For teams who deploy software to users around the world, every second counts when responding to outages and other incidents. It’s important that you have tools in your arsenal that are up to the challenge. Service monitoring, alerting, collaboration, and visibility are all essential components of a well-implemented incident response plan.