The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
Our status page product started last year as an experiment. We built a status page product in a hurry over weekends, and to our surprise, it gained a lot of traction. People were using it and giving us feedback, which helped us improve the product over time. And this year, we're thrilled to announce that we have great things planned for our status page product! The new revemped dashboard is part of a larger plan for our status page product. Here's a quick gist of the multiple releases.
Earlier this year, we experienced intermittent timeouts in our application while interacting with our database over a period of two weeks. Despite our best efforts, we couldn’t immediately identify a clear cause; there were no code changes that significantly altered our database usage, no sudden changes in traffic, and nothing alarming in our logs, traces, or dashboards. During that two-week period, we deployed 24 different performance and observability-focused changes to address the problem.
More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts, teams don’t have enough time to look at each one. And they struggle to parse and consolidate this data in order to figure out what they need to do next to resolve an incident.
We kicked off 2023 with a new team and a new product to build - Status Pages. We wanted to build a solution we could ship to customers as quickly as possible, while making sure that it’s reliable, fast and beautiful. Here’s how that process played out over the course of three months.
Working across different time zones can present significant challenges when it comes to fostering a team culture. I came across a typical scenario in a geographically distributed team with their Engineering team members based in New York and Poland. They are set to welcome a new Director of Engineering based on the West Coast. With minimal daily overlap between the teams, the question arose about how to create and manage their team culture.
Clear and frequent communication carries considerable weight in today's era of hyper-competition among businesses—especially during incidents. Because of this, status pages have become the go-to choice for companies looking to prioritize trust, transparency, and clarity with their customers, even during downtime. Unfortunately, current status page solutions have made these communications particularly frustrating and stressful.
IT incidents are events which lead to a disruption or deviation from the regular operating standards of a computer system or network. They can be caused by various factors, including hardware or software failures, human error, or even deliberate external (cybersecurity) attacks. It begins with short delays, or services cutting out - for example, when a website or server is down, or access to data(bases) takes too long.