Operations | Monitoring | ITSM | DevOps | Cloud

Monitoring that Monitors the Monitors of the Monitors

One way to break the cycle of alert fatigue is by improving the quality of the signals you monitor. That can mean greater resolution at which monitoring data is ingested and processed, smarter statistical methods for aggregating and correlating data across multiple services, or routing alerts through an escalation and incident management system.

This IS NOT Fine: Putting Out (Code) Fires

So the dumpster is on fire. Again. The site’s down. Your boss’s face is an ever-deepening purple. And you begin debating whether you should join the #incident channel or call an ambulance to deal with his impending stroke. Firefighters have clear procedures and a strong hierarchy. The first truck at a scene immediately begins assessing the situation.

Another Journey of Chaos Engineering

Chaos engineering is here to stay. There's a thriving community, numerous open source projects, a few books, even a startup. Companies are hiring chaos engineers and creating entire teams focused on chaos engineering. This talk is about strategies for launching a chaos engineering movement at your company, as well as the challenges and results you can expect.

Accelerating Incident Response

Incidents are never fun, but a bad incident response process makes them even less so. How do technical teams mobilize the right people and provide the right context and tooling to rapidly take action and drive incident resolution? With the clock ticking and up to millions of dollars lost per minute of downtime, there’s no time to waste in assembling the right experts.

How StatusHub Complements and Extends Your Incident Management Process?

Although the main focus of StatusHub is incident communication, it compliments each 5 activities of Incident Management: Identification, Categorization, Prioritization, Response and Communication with the user community through the life of the incident.

Postmortems and Retrospectives (class SRE implements DevOps)

Even after a service has been restored, SREs still have a bit of work to do. In this video, Liz and Seth discuss the postmortem process that SREs follow. Blameless postmortems and retrospectives are key to learning from failures and preventing recurrence. You will learn about the importance of conducting a postmortem, strategies for conducting a blameless postmortem, and techniques for trending retrospectives across your entire organization to gain better insights to prevent service disruptions in the future.

Overrides, the Most Human Feature in PagerDuty

If you’ve ever been on call, you know that the incidents don’t stop because you have the flu. Or when you’re attending your child’s high school graduation. Or, as I found out firsthand, even when you’re at your own wedding. Confucius once said, “If you have never had a major occasion happen while you are on call, then you may not have ever lived.” (Okay, I totally made that one up.)

It's Time to Start Talking about Digital Operations

IT operations teams have some of the most stressful jobs in IT. Keeping data centers online, servers running, enterprise systems functioning, and applications performing — all while responding to incidents and requests is hard work. While there are monitoring systems in place to provide visibility and change management practices give IT some control over the network and environment, IT operations teams constantly feel like they are fighting a losing battle.

AlertOps Announces Playbook Automation Focusing on Critical Enterprise Needs in Fast-growing Incident Response Market

CHICAGO, Oct. 9, 2018 /PRNewswire/ — Illinois-based digital operations management and real-time collaboration platform AlertOps, announces a renewed focus on Enterprises in the IT Operations Management, DevOps, and SecOps spaces. CIOs and IT leaders need vendors that can merge technology and business scenarios to solve complex collaboration and communication problems.