Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Failover Conf Wrapup

Failover Conf was held on April 21, 2020, online. The folks at Gremlin came up with the idea of a virtual conference about reliability after many in-person conferences started being postponed or canceled due to COVID-19. The conference was a lot of fun to attend. I’ll be sharing some of my thoughts on the event and the talks I was able to catch. The videos for the talks haven’t been posted yet, but I’ll update this post with links to them when they are.

Azure service health alerts and escalation with Zenduty

Microsoft Azure is a cloud computing service providing infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) supporting multiple Microsoft Specific and third-party services and systems with 90+ compliance offerings and trusted by 95% of Fortune 500 companies to base their business on. What is a system downtime and how does it affect me or my business?

How PagerDuty and Partner Rundeck Enable Business Continuity for Digital Operations

At times like these when the world has been forced to adapt and go almost entirely digital, it’s imperative that our systems and platforms stay up and operational—all the times. We are going to great lengths to make sure that the hardware and software in our application stacks are reliable and responsive. Hardware is set up to have redundant backups and new code is tested and reviewed to make sure it doesn’t introduce any bugs into the system.

Darwin Was Right: Change Will Separate the Strong from the Weak

“It is not the strongest or the most intelligent who will survive, but those who can best manage change” said Charles Darwin over 150 years ago – and probably every IT Ops engineer out there these days would agree with him. According to Gartner (and probably your experience as well), over 80% of service disruptions these days are caused by changes in infrastructure and software.

Virtualize the NOC: Futureproof Your IT Investment with AIOps

By abruptly forcing most people to work from home, and by triggering an economic crisis, the global pandemic has upended business operations. Not only must business leaders facilitate remote work among their employees, but they must also accommodate new ways of interacting with suppliers, partners and customers. Meanwhile, businesses’ digital channels and infrastructure, already critical prior to the crisis, have become even more essential, and yet harder to monitor and manage.

Reflections on Gremlin's Failover Conf

April 21, 2020 thousands of industry professionals came together virtually to attend a revolutionary conference, Gremlin’s Failover Conf. With dozens of cancelled events, social distancing policies, and heightened stress due to the current crisis, it was more necessary than ever to take a moment to learn, share, and talk to one another about something we are all passionate about. We loved the experience at Failover Conf, and want to share some of our favorite parts with you.

Getting SRE Buy-in from C-Levels for Error Budgets and SLOs, Part 3

You now have postmortems properly implemented, automated, and well-structured. You’re generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you’re making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic, important effort will be proving to your C-levels why they should buy into SRE.

Grafana alerts and incident escalation with Zenduty

Grafana is one of the most popular open-source visualization tools that can be used on top of a variety of different data stores but is most commonly used together with Graphite, InfluxDB, Prometheus, Elasticsearch, Prometheus, AWS CloudWatch, and many others. Reliability engineers use Grafana is its ability to bring together several data sources together in a unified dashboard and increase the observability of your production systems.

Thought Leadership Panel: What is a "real" SRE?

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.