Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What are Canary Deployments and Why are they Important?

Every modification to software comes with the potential for production problems. Application failures often have serious consequences which can result in a loss of revenue and a poor customer experience. Additionally, organizations constantly try to improve their services for a better customer experience. How can you minimize the chance of error and update your application with confidence?

incident.io + Indent - on-demand system access

At incident.io, we empower teams to run incidents quickly and effectively from start to finish. One of the ways we help is by taking the manual admin out of your incidents. More often than not, folks are spending too much time thinking about the process, when the time would be better spent focusing on fixing. Our automated workflows, nudges and prompts help to embed best practices and unlock time for more impactful work.

Mattermost Playbooks How-to: OKR Management

Creating, managing, and tracking high level goals can be incredibly burdensome and complex for organizations with numerous stakeholders and cross-functional collaboration. Team leads and executives manage multitudes of reporting tools and departments while contributors often have little visibility into the process of creating goals or the progress towards achieving those goals.

Performing Postmortems & Postmortem Templates at Squadcast | SRE Best practices | Squadcast

Postmortems are a way to summarize the resolution for an incident once it is resolved. It is also a way for you to create a knowledge-base of failures and fixes that can be shared across your team to help build a culture of shared learning and learning from failures.

Updating our data stack

It’s been over 6 months since Lawrence’s excellent blog post on our data stack here at incident.io, and we thought it was about time for an update. This post runs through the tweaks we’ve made to our setup over the past 2 months and challenges we’ve found as we’ve scaled from a company of 10 people to 30, now with a 2 person data team (soon to be 3 - we’re hiring)!

Defining a Strategy for Process Automation

As business systems grow to encompass more locations, tools, and organizations, defining processes that keep pace with these changes can’t be left to a hodgepodge of disconnected programs—or worse, manual implementation of paper documentation. You need to automate. Automation within businesses first arose in the 1960s, alongside resource planning systems.

Feeling zen, finding DORA, and the policy police

We’ve had a bumper month here at incident.io HQ. We’ve welcomed 3 new joiners, celebrated two 1 year incident.io anniversaries (congrats Lisa and Lawrence!), released a whole load of exciting new features and (for those of you wondering what’s been causing the recent heatwave) we’ve redesigned our website and it is on fire 🔥 😎 Here’s a round-up of some of this month's highlights…

PagerDuty Service Standards helps organizations better configure services at scale

Service ownership, a DevOps best practice, is a method that many companies are pivoting towards. The benefits of service ownership are varied and include boons such as bringing development teams much closer to their customers, the business, and the value being delivered. The “build it, own it model” has tangible effects on customer experience, as developers are incentivized to innovate and drive customer-facing features that delight.

RESOLVE '22: AIOps: Not just a buzz phrase anymore

Thinking back to the rapidly expanding tech world of the 2010s, it’s easy to list off a number of buzzwords and phrases that became IT Ops mainstays over time. “Internet of things,” “big data” and even ideas as simple as the cloud were all once considered little more than slick marketing talk.

Mean Time to Recovery (MTTR) explained

It's Friday afternoon, and you have mail. Apparently, a user received a 500 error when attempting to sign in. She contacted Customer Service. They didn't know what to do, so they forwarded the email to your engineering team. A close look at the email thread reveals that Customer Service received it... on Tuesday. And they sat on it until today. ‍ Hopefully, it was just this one user. You open your browser, navigate to the web application, and attempt to sign in. You also get a 500 error.