Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

SRE Back-to-School Checklist

Whether it's in classrooms or on Zoom calls, the kids have headed back to school! Bright-eyed students are gearing up to study new subjects and test their brains. Hopefully on their report cards, failure isn’t inevitable. Before the first day, parents load up their kids’ backpacks with everything they’ll need. Being well equipped with good supplies is the best way to stay focused and educate “reliably”. Likewise, SREs need the right tools and practices for the job.

Divisions of Family Practice Adopts OnPage to Enhance Clinical Communication

Effective healthcare communication requires proper software and processes to ensure that the right person receives timely messages. Unfortunately, Divisions of Family Practice (DoFP), a large community-based network of physicians located in British Columbia, Canada, relied on a third-party answering service to connect long-term care facilities (LTCFs) with on-call providers.

What is expected in the SRE role? We analyzed 30 job postings to find out.

In 2016, Google released the definitive book on Site Reliability Engineering (SRE) - a practice that had originated in the company to take care of a monumental problem - how to keep the Google services running with high reliability. Over the years, SRE has been widely adopted by dev teams across the globe and is a popular role at startups and enterprises alike. Here is a look at how search for SRE has trended over the years.

How Do I Add a Major Incident Response to an Existing Integration? - Ask Adam

When we receive an alert, the obvious choice is to accept responsibility for the issue and start resolving it ourselves. But, what happens when the incident is far more major than we thought? With xMatters, you don't have to scramble to find who else is on-call, you can configure the platform to help find other responders for you.

A Migration That Paid Tech Dividends

TL;DR: Old, deprecated code/infrastructure is a challenge that every engineer will come across. Remedy what you can and remember that some extra effort can go a long way. It can uncover issues that, when addressed, will save you in the future. Part of the challenge of software development is maintaining legacy code and infrastructure. When you ignore or neglect these, issues start to pop up and your reliability suffers, causing pain for your customers. The trick here is to actively steward each project.

3 Ways to Use the xMatters and Microsoft Azure Monitor Integration

For a number of years, the debate on DevOps vs. ITIL has divided many technology teams. On the surface, both practices seem at odds with one another—DevOps harnesses the power of human collaboration and communication to support innovation, while ITIL utilizes a more systematic and structured approach to deliver service quality and consistency. But, if we take a deeper look, you’ll find that not only can DevOps and ITIL co-exist, they can even complement each other.

Best practices for writing incident postmortems

After you have stopped an incident from affecting your customers, you need a more thorough investigation in order to prevent similar incidents in the future. Postmortems record the root causes of an incident and provide insights for making your systems more resilient. At the same time, postmortems can be difficult to produce, since they require deeper analysis and coordination between teammates who are busy with the next development cycle.