Latest News

How we model our data warehouse

Nov 8, 2024 By Jack Colsey In Incident.io

We've written several times about our data stack here incident, but never about our underlying data warehouse and the design principles behind it. This blog post will run through the high-level structure of our data warehouse and then will go in-depth into the underlying layers.

Read Post

Incident.io

Read more about How we model our data warehouse

Site Reliability Engineer's Guide to Black Friday

Nov 7, 2024 By Zoe Collins In OnPage

It’s gotten to the point where Black Friday reliability prep has to start on…well Black Friday. This year, 32% of consumers in the US claimed that they were going to start their holiday shopping in July-October. Plus, Black Friday isn’t the only day eCommerce businesses have to worry about, now we have Cyber Monday, Travel Tuesday, and the thousands of Prime Days from Amazon.

Read Post

OnPage

Read more about Site Reliability Engineer's Guide to Black Friday

Engineering an AI Proxy for ilert

Nov 7, 2024 By Daria Yankevich In iLert

Building an AI proxy for our AI features was one of the best decisions we made a year ago. In this article, we will share why and what challenges we faced.

Read Post

iLert

Read more about Engineering an AI Proxy for ilert

Lessons from 4 years of weekly changelogs

Nov 7, 2024 By Pete Hamilton In Incident.io

Writing a meaningful update for customers every week has been held sacred at incident.io since we started the company. We've written over 200 of them in the past 4 years, and we recently celebrated going 2 years straight without missing a single a single week The numbers themselves are not the goal, but the consistency of this habit and what it represents for our customers and our team is very real, and special to me.

Read Post

Incident.io

Read more about Lessons from 4 years of weekly changelogs

Operationalizing AI for IT operations

Nov 6, 2024 By Conor Castronovo In BigPanda

Advances in artificial intelligence are rapidly transforming the IT operations landscape. According to Enterprise Strategy Group, 85% of organizations use or plan to deploy AI across many functional areas, including IT operations. Among its many benefits, AI can help ITOps teams: AI has immense potential to transform how IT operations, service management, and infrastructure teams function. Adoption is the first step toward creating organizational change.

Read Post

BigPanda

Read more about Operationalizing AI for IT operations

Did Delta's slow web performance signal trouble before CrowdStrike?

Nov 6, 2024 By Denton Chikura In Catchpoint

The CrowdStrike outage was a reminder of how quickly the dominoes can fall—especially when the foundation is shaky. Delta Airlines was hit harder than its competitors. While United and American Airlines were able to recover within days, Delta faced ongoing struggles, leading to the cancellation of 7,000 flights over five days.

Read Post

Catchpoint

Read more about Did Delta's slow web performance signal trouble before CrowdStrike?

What is Uptime? Best Strategies to Improve Uptime

Nov 6, 2024 By Rohan Taneja In Zenduty

Uptime is a metric often used by organizations to measure website or application availability to their end users. Or as defined by Techopedia, uptime is a metric representing the percentage of time hardware, an IT system, or a device is operational. It indicates when a system is working, while downtime refers to when it is not. In today's fast-paced digital world, a website or application's availability is of utmost importance.

Read Post

Zenduty

Read more about What is Uptime? Best Strategies to Improve Uptime

Against Incident Severities and in Favor of Incident Types

Nov 4, 2024 By Fred Hebert In Honeycomb

About a year ago, Honeycomb kicked off an internal experiment to structure how we do incident response. We looked at the usual severity-based approach (usually using a SEV scale), but decided to adopt an approach based on types, aiming to better play the role of quick definitions for multiple departments put together. This post is a short report on our experience doing it.

Read Post

Honeycomb

Read more about Against Incident Severities and in Favor of Incident Types

Observability as a superpower

Nov 4, 2024 By Sam Starling In Incident.io

With every job I have, I come across a new observability tool that I can’t live without. It’s also something that’s a superpower for us at incident.io: we often detect bugs faster than our customers can report them to us. A couple of jobs ago, that was Prometheus. In my previous job, it was the fact that we retained all of our logs for 30 days, and had them available to search using the Elastic stack (back then, the ELK stack: Elasticsearch, Logstash, and Kibana).

Read Post

Incident.io

Read more about Observability as a superpower

The No-Nonsense Guide to Runbook Best Practices

Nov 2, 2024 By Hrishikesh Barua In IncidentHub

Runbooks are a key part of incident management and preserve institutional knowledge. They can be used for both incident response as well as routine tasks like db maintenance and generating a complex report. We are mostly focused on incident response runbooks here.

Read Post

IncidentHub

Read more about The No-Nonsense Guide to Runbook Best Practices

Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

How we model our data warehouse

Site Reliability Engineer's Guide to Black Friday

Engineering an AI Proxy for ilert

Lessons from 4 years of weekly changelogs

Operationalizing AI for IT operations

Did Delta's slow web performance signal trouble before CrowdStrike?

What is Uptime? Best Strategies to Improve Uptime

Against Incident Severities and in Favor of Incident Types

Observability as a superpower

The No-Nonsense Guide to Runbook Best Practices

Monthly Archive

Follow Us