Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Sponsored Post

How To Reduce Incident Tickets

In IT environments, incidents happen all the time and it's impossible to prevent all of them. Regardless of the available software solutions or the level of technical training of both users and developers, no organization is immune to incidents. The increased dependence on IT infrastructure to provide core services means that any disruption in IT services can cause any organization significant financial and reputational harm. For example, IT service providers need to resolve customer support tickets following the service-level agreements (SLAs), and failing to do so makes them liable for breaching such agreements.

Sponsored Post

What are Runbooks? And why are they needed?

Imagine being an Ops engineer in a team just struck by tragedy. Alarms start ringing, and incident response is in full force. It may sound like the situation is in control. WRONG! There's panic everywhere. The on-call team is scrambling for the heavenly door to redemption. But, the only thing that doesn't stop - Stakeholder Inquiries. This situation is bad. But it could be worse. Now imagine being a less-experienced Ops engineer in a relatively small on-call team struck by tragedy. If you don't have sufficient guidance, let alone moral support- you're toast.

RESOLVE '22: Expert predictions for AIOps 2022-2025

BigPanda’s RESOLVE ‘22 conference hosted a number of luminaries in the AIOps and IT Ops world, so naturally we needed to get their thoughts on the future of the market and where they see AIOps going in the next few years. Our guests for the session titled Expert predictions for AIOps 2022-2025 were from the press, investor community, analyst community and vendor world.

Using StatusPage at squadcast | SRE Best practices | Squadcast

Let your customers know how your Services are doing, without them having to ask you about it. One of the core principles of SRE is Transparency and Status Pages help you communicate the status of your Services to your customers at all times, as opposed to you getting to know the status of your Services through support tickets logged by your customers.

What are Canary Deployments and Why are they Important?

Every modification to software comes with the potential for production problems. Application failures often have serious consequences which can result in a loss of revenue and a poor customer experience. Additionally, organizations constantly try to improve their services for a better customer experience. How can you minimize the chance of error and update your application with confidence?

incident.io + Indent - on-demand system access

At incident.io, we empower teams to run incidents quickly and effectively from start to finish. One of the ways we help is by taking the manual admin out of your incidents. More often than not, folks are spending too much time thinking about the process, when the time would be better spent focusing on fixing. Our automated workflows, nudges and prompts help to embed best practices and unlock time for more impactful work.

Mattermost Playbooks How-to: OKR Management

Creating, managing, and tracking high level goals can be incredibly burdensome and complex for organizations with numerous stakeholders and cross-functional collaboration. Team leads and executives manage multitudes of reporting tools and departments while contributors often have little visibility into the process of creating goals or the progress towards achieving those goals.

Performing Postmortems & Postmortem Templates at Squadcast | SRE Best practices | Squadcast

Postmortems are a way to summarize the resolution for an incident once it is resolved. It is also a way for you to create a knowledge-base of failures and fixes that can be shared across your team to help build a culture of shared learning and learning from failures.

Feeling zen, finding DORA, and the policy police

We’ve had a bumper month here at incident.io HQ. We’ve welcomed 3 new joiners, celebrated two 1 year incident.io anniversaries (congrats Lisa and Lawrence!), released a whole load of exciting new features and (for those of you wondering what’s been causing the recent heatwave) we’ve redesigned our website and it is on fire 🔥 😎 Here’s a round-up of some of this month's highlights…

Updating our data stack

It’s been over 6 months since Lawrence’s excellent blog post on our data stack here at incident.io, and we thought it was about time for an update. This post runs through the tweaks we’ve made to our setup over the past 2 months and challenges we’ve found as we’ve scaled from a company of 10 people to 30, now with a 2 person data team (soon to be 3 - we’re hiring)!