Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Mistakes happen for many reasons #incidentmanagement

In this clip, Dennis Henry of Okta explains why it's important to remember that mistakes happen for several reasons and don't have a single cause. In last week’s episode of The Debrief, we had on Colette Alexander, Director of Engineering at HashiCorp, to discuss some of the myths around incident response.

IRL to IAC: Your Environment to PagerDuty via Terraform

Figuring out how to represent your as-built environment in PagerDuty can be confusing for new users. There are a lot of components to PagerDuty that will help your team be successful managing incidents, integrating with other systems in your environment, running workflows, and using automation. Your organization might have a lot of these components – users, teams, services, integrations, orchestrations, etc.

Live event recap: Humanizing the on-call experience

There’s no two ways about it: on-call is stressful. But with humans at the center, it’s especially important to find ways to make it as manageable and empathetic as possible. In this webinar with our friends at ELC, incident.io VP of Engineering, Noberto Lopes, and Intercom Staff Product Engineer, Andrej Blagojević, discuss their own experiences with on-call, and how the process can be better.

Incident Management: 5 Best Practices for Seamless Operations

Website incidents happen at any time for any reason. Your website might stop responding to customers. Performance may slow down. Main pages start giving client or server errors. And when they do strike, it brings frustration and confusion to your customer, leading to lower trust and engagement.

Upskilling your Network Operations Center

Many organizations are heavily investing in AI and automation to remove the burden of manual work and operational efficiency. However to drive their wide scale adoption, they also need employees who can collaborate effectively with the technology. To bridge that gap, companies can use upskilling to retain talent, mitigate risks to the business, and allow employees to grow their careers.

Why "why" is the wrong question to be asking after incidents with Dennis Henry of Okta

In last week’s episode of The Debrief, we had on Colette Alexander, Director of Engineering at HashiCorp, to discuss some of the myths around incident response. In that conversation, one of the myths we spoke about was the idea that asking “why” is better than asking “how.” And how, in reality, asking "how" allows you to focus more on the contributing factors that led to an incident happening, whereas “why” tends to single out a person, which can lead to a lot of blame.

Improve incident triage with AIOps to reduce downtime

Downtime is expensive, both to your budget and your brand reputation. As IT outage costs increase, it’s critical to identify and prioritize incidents quickly to minimize the impact on your organization. In a recent survey of more than 400 global IT professionals, Enterprise Management Associates found that unplanned downtime costs average $14,056 per minute. That’s an increase of nearly 10% from 2022.

Automation Triumphs Real-World DevOps Automation Implementations

Remember the pre-automation days in DevOps? Endless server configurations, manual deployments that took hours (or days!), and a constant feeling of being buried in repetitive tasks. Yeah, those were the times... �� Thankfully, those days are fading fast. The magic of automation has swept through the DevOps landscape, transforming tedious workflows into streamlined processes.

Chart a course for Operational Excellence with PagerDuty's Operational Maturity Model

A top priority for many technical leaders is improving the performance and efficiency of their teams to maximize results and minimize costs. With the PagerDuty Operational Maturity Model, IT teams can reduce the total cost of ownership with better efficiency, mitigate the risk of operational failure to ultimately protect customer experience, and shift from a reactive state towards a more proactive approach—by using the PagerDuty Operations Cloud.