Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Assembly time is where you have the most control of an incident

The FDNY EMS Command responds to more than 4,000 calls per day. They range from car accidents to building fires to cats stuck in trees, and responses vary accordingly. Sometimes they might take hours, sometimes they take just a few minutes. With such unpredictable conditions, the FDNY focuses on improving what they call “response time.” That’s the amount of time between a 911 call being made and emergency responders arriving on the scene. This might sound familiar.

Trust shouldn't start at zero

How often have you heard the phrase “trust is earned” in life? While well-meaning, I think this can actually lead to some strange behaviour at work, especially when you’re on a fast growing team. Startups experience a lot of chaos and unknowns your teams need to navigate, so it’s vital to know you can trust the people around you. As you grow, how you set expectations around trust as people join your team can impact your ability to hire, onboard, ship and ultimately, survive.

How to Manage Customer Support Channels in Slack: A Step-by-Step Plan

As more and more teams transition to remote work, collaboration tools like Slack have become increasingly popular. Slack's chat-based communication platform makes it easy to keep teams connected and informed, but it can also create challenges when it comes to managing support channels. In this post, we'll explore different approaches to building a Slack-based support system and provide some tips for success.

10 Mistakes to avoid when framing your IT Incident Management Strategy

An IT incident is an unplanned disruption that negatively impacts an IT service. As the importance of IT to the business has increased, the impact of IT incidents has become greater. IT incidents can result in revenue loss, loss of employee productivity, SLA financial penalties, government fines, and more. An effective IT incident management strategy is now essential in every organization. For a business like Amazon whose entire business relies on IT, a single second of slowness can cost over $15,000.

How to get started with incident management metrics

Tracking incident metrics can help you discover patterns in the causes and costs of incidents and help you understand brittle parts of your organization. We've seen them help teams zero in on things like: But it can be intimidating to get started. Do you really need metrics if you're a small team or just beginning to formalize your incident management program? I say yes. The key is to start with something manageable and grow.

How Abbott transformed its incident management process with Workflow Automation

Eliminating errors and streamlining the incident management process are top priorities for many ITOps, NOC, SRE, and DevOps teams. With organizations using multiple tools in their IT stack, manually finding the right information at the right time becomes crucial during incident triage. By automating tasks and workflows, businesses can eliminate manual tasks that are time-consuming, repetitive, and prone to mistakes.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue.

Reflecting on one of the biggest incidents in our history

We have to come clean. During KubeCon, we experienced an incident that we weren’t ready to discuss until now. This incident caused quite a disruption and, had it been left unresolved, would have had a massive snowball effect. At the time, we didn’t want to raise any alarms, so we kept it quiet while our team rallied to resolve it. And to be honest, most folks probably didn’t even realize that it happened since we moved so quickly.

It's time to rethink the way you do external comms

April was a month to remember at incident.io. Not only did we attend our second conference ever with KubeCon in Amsterdam, but we also very subtly released our brand-new Status Pages product. OK, it probably wasn't subtle. Both moments required months of preparation, feedback loops, iteration, and so much more behind-the-scenes work to get right. So if you ran into us at KubeCon, thank you for stopping by and meeting with our team.

Mastering IT Response Time

In today’s fast-paced digital landscape, businesses heavily rely on their IT departments to ensure smooth operations and deliver exceptional customer experiences. When it comes to IT support, one critical metric stands out: response time. A prompt and efficient response can be the difference between a satisfied customer and a frustrated one. In this blog post, we will explore strategies to improve IT response times, enhance customer satisfaction, and optimize overall productivity.