Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How Mercari Scales Vision, Culture, & Reliability

In a recent fireside chat with Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc. sat down with Blameless Co-Founder Ashar Rizqi. They talked about scaling while avoiding silos, exciting day-to-day challenges, instilling a culture of empowerment, and more. Here are their top insights and the lightly edited transcript of their conversation.

How small changes to your SLOs can be SMART for your business - A narrative case study

In the second part of his "Choosing SLOs that are appropriate for our customers" blog, Adam Hammond, narrates a fictional case study through Bill Palmer, one of the protagonists of The Phoenix Project and shows "How small changes to your SLOs can be SMART for your business" In our previous blog, we discussed why you need to choose SLOs that are appropriate for your customers. We don’t always write out S M A R T and list our SLOs immediately. The process is organic, and it may take a while.

Blameless Book Club: Implementing Service Level Objectives, Part 1

At Blameless, we value every opportunity to learn. Whether it’s taking time on Focus Fridays to attend a cool webinar, or conducting retrospectives for incidents, lost deals, events, and more, learning is core to our mission. To learn even more about our craft, we decided to start a book club at Blameless. People from every team (engineering, sales, SRE, marketing, product, people, and more) attended.

Automating Monitoring & Alerting Infrastructure with Terraform

At iLert we embrace infrastructure as code and try to automate our processes whereever possible. This might reach from niftly little bash scripts to fully blown Terraform projects that spin up whole environments with as little as terraform apply on a CLI. With Hashicorp’s Terraform you can make use of infrastructure as code to provision and manage any cloud, infrastructure, or service.

Mattermost release v5.29 is now available: Incident Management, Mattermost Cloud & more

Mattermost release v5.29 is generally available today. In addition to offering bug fixes for increased stability, the new quality release features the general availability of a pre-installed incident management application, channel moderation settings, and Mattermost Omnibus.

Running Operations Is Hard. PagerDuty + Rundeck Are Here to Help

Rundeck has now joined forces with PagerDuty. What pulled us together? Our shared vision for improving the work lives of those who run modern digital services. As a co-founder of Rundeck, I’d like to provide my perspective on why Rundeck becoming part of the PagerDuty family is a perfect fit for our collective user communities. No matter if you are on a “you build it, you run it” DevOps team or part of a centralized Ops team—operations work has always been difficult.