Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

Accelerating Incident Response

Incidents are never fun, but a bad incident response process makes them even less so. How do technical teams mobilize the right people and provide the right context and tooling to rapidly take action and drive incident resolution? With the clock ticking and up to millions of dollars lost per minute of downtime, there’s no time to waste in assembling the right experts.

Four Ways to Adapt ITSM to an Agile World

The transition to Agile development and continuous deployment has resulted in the DevOps movement to break down organizational walls. While there are many benefits to this approach, some best practices of traditional IT Service Management (ITSM) have been lost in the transition. Which ITSM processes and controls are still relevant and how can you adapt them to the new agile world?

Another Journey of Chaos Engineering

Chaos engineering is here to stay. There's a thriving community, numerous open source projects, a few books, even a startup. Companies are hiring chaos engineers and creating entire teams focused on chaos engineering. This talk is about strategies for launching a chaos engineering movement at your company, as well as the challenges and results you can expect.

How StatusHub Complements and Extends Your Incident Management Process?

Although the main focus of StatusHub is incident communication, it compliments each 5 activities of Incident Management: Identification, Categorization, Prioritization, Response and Communication with the user community through the life of the incident.

Postmortems and Retrospectives (class SRE implements DevOps)

Even after a service has been restored, SREs still have a bit of work to do. In this video, Liz and Seth discuss the postmortem process that SREs follow. Blameless postmortems and retrospectives are key to learning from failures and preventing recurrence. You will learn about the importance of conducting a postmortem, strategies for conducting a blameless postmortem, and techniques for trending retrospectives across your entire organization to gain better insights to prevent service disruptions in the future.

Overrides, the Most Human Feature in PagerDuty

If you’ve ever been on call, you know that the incidents don’t stop because you have the flu. Or when you’re attending your child’s high school graduation. Or, as I found out firsthand, even when you’re at your own wedding. Confucius once said, “If you have never had a major occasion happen while you are on call, then you may not have ever lived.” (Okay, I totally made that one up.)

It's Time to Start Talking about Digital Operations

IT operations teams have some of the most stressful jobs in IT. Keeping data centers online, servers running, enterprise systems functioning, and applications performing — all while responding to incidents and requests is hard work. While there are monitoring systems in place to provide visibility and change management practices give IT some control over the network and environment, IT operations teams constantly feel like they are fighting a losing battle.