Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Site Reliability Engineer's Guide to Black Friday

It’s gotten to the point where Black Friday reliability prep has to start on…well Black Friday. This year, 32% of consumers in the US claimed that they were going to start their holiday shopping in July-October. Plus, Black Friday isn’t the only day eCommerce businesses have to worry about, now we have Cyber Monday, Travel Tuesday, and the thousands of Prime Days from Amazon.

Lessons from 4 years of weekly changelogs

Writing a meaningful update for customers every week has been held sacred at incident.io since we started the company. We've written over 200 of them in the past 4 years, and we recently celebrated going 2 years straight without missing a single a single week The numbers themselves are not the goal, but the consistency of this habit and what it represents for our customers and our team is very real, and special to me.

Operationalizing AI for IT operations

Advances in artificial intelligence are rapidly transforming the IT operations landscape. According to Enterprise Strategy Group, 85% of organizations use or plan to deploy AI across many functional areas, including IT operations. Among its many benefits, AI can help ITOps teams: AI has immense potential to transform how IT operations, service management, and infrastructure teams function. Adoption is the first step toward creating organizational change.

Did Delta's slow web performance signal trouble before CrowdStrike?

The CrowdStrike outage was a reminder of how quickly the dominoes can fall—especially when the foundation is shaky. Delta Airlines was hit harder than its competitors. While United and American Airlines were able to recover within days, Delta faced ongoing struggles, leading to the cancellation of 7,000 flights over five days.

What is Uptime? Best Strategies to Improve Uptime

Uptime is a metric often used by organizations to measure website or application availability to their end users. Or as defined by Techopedia, uptime is a metric representing the percentage of time hardware, an IT system, or a device is operational. It indicates when a system is working, while downtime refers to when it is not. In today's fast-paced digital world, a website or application's availability is of utmost importance.

Against Incident Severities and in Favor of Incident Types

About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually using a SEV scale), but decided to adopt an approach based on types, aiming to better play the role of quick definitions for multiple departments put together. This post is a short report on our experience doing it.

Observability as a superpower

With every job I have, I come across a new observability tool that I can’t live without. It’s also something that’s a superpower for us at incident.io: we often detect bugs faster than our customers can report them to us. A couple of jobs ago, that was Prometheus. In my previous job, it was the fact that we retained all of our logs for 30 days, and had them available to search using the Elastic stack (back then, the ELK stack: Elasticsearch, Logstash, and Kibana).