Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to talk to your executive leadership team about reliability

Product reliability requires investment from all areas of the business. Technology leaders must effectively communicate the implications of service reliability to the rest of the organization. As a leader, how do you prove that a more reliable product is critical to success? Experts from BetterCloud, Machinify and Blameless come together to discuss how to talk to your executive leadership team about reliability in this webinar.

The Inevitable - Failures in Distributed Systems

Experiencing failure at scale is as the popular Marvel character Thanos would say “Inevitable”. Memory leaks, software or hardware or network I/O failures are just a few. It’s a problem of simple mathematics, the probability of failing rises as the total number of operations performed increases. With each component used to scale the application, the failure quotient increases. So how do you tackle this so-called “Inevitable” problem that comes with scaling?

10 Points of consideration for investing in an Observability Platform for your organization.

10 Points of consideration for investing in an Observability Platform for your organization: Scalability Can the observability platform handle the volume of data that your organization generates? Compatibility Is the observability platform compatible with your organization's existing systems and technologies? Ease of use Is the observability platform user-friendly and easy for your team to adopt and use?

IT Workflow Explanation

IT Workflow Automation serves to automates the execution of IT tasks and processes. This can include everything from provisioning new servers and deploying software updates to monitoring and troubleshooting IT systems. Workflow automation helps organizations reduce the time and effort required to perform these tasks by automating manual processes and eliminating the need for manual intervention. It can also improve the accuracy and consistency of these processes, as there is less room for human error.

[PODCAST] Episode 1 Season 2; How to successfully build and defend your 2023 ITOps budget

It’s that time of year when ITOps leaders quantify their plans in budgets that must compete with other equally hungry groups for limited corporate resources. How can the thankless task of proactively preventing outages and speeding time to resolution win against funding flashier projects? Real-world facts can make that difference. Some of the major topics Nigel and Craig will discuss is how to help organizations successfully build and defend their 2023 ITOps budget for investments in tooling, headcount, and workflow improvements.

PagerDuty Status Pages Enable Real-Time, Proactive Customer Communication During Incidents

Integrated, Intuitive Feature Saves Time and Money, Aligning Technical and Customer-Facing Teams, Allowing Further Consolidation on to the PagerDuty Platform, and Building Customer Trust During Large-Scale Events.

Easy to manage fine-grained access control and roles

A neatly setup access control telling which user can do exactly what on an incident management platform can save a lot of time and hassle in the future. In the past, Spike.sh had only 2 roles - Admin and Member. The only difference in these roles were that only Admins can remove members. It was fairly simple and most users liked it. However, with larger teams coming onboard, it gets a little difficult to control for admins. So, we have empowered the existing system by adding two more roles.

5 best incident management tools of 2023

Put simply, managing incidents—big or small—is good for business. Not only is it a regulatory requirement, but also a factor in your profits. Your customers expect smooth operations, good customer service and protection. A dedicated incident management tool can help protect all of these. While many may think of incidents as an IT or DevOps issue, it’s hard to over emphasize that they can happen in any department.

Incident Management Tools - Do I Even Need Them?

Software is hard… Maintaining software reliability is harder than it used to be. Software systems have grown dramatically in complexity, as they’re applied in a wider range of applications and environments. Many of which have become fundamental to the everyday function of our society. On the other hand, the pace of software development and release is also faster than ever. Innovating new features faster than competitors has become the key to success in a rapidly-changing market.

Managing incidents in a growing organisation - incident.fm

In this week's episode, we're joined by Matt Huxtable, CTO at Ziglu (an e-money issuer, offering a variety of digital finance services, particularly well known for its cryptocurrency services). Matt talks about how the engineering team at Ziglu has evolved over time, building an agile culture and why "keep it boring" is his mantra. Chris, Pete and Matt cover how to context switch between solving and communicating during an incident, their most creative incident fixes and why AI isn't ready to solve incidents for us just yet.