Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The Dangers of Alert Fatigue: Strategies for Effective Alert Management

Alert fatigue is a serious issue that affects numerous professions, especially in the IT industry. It can lead to neglecting critical events and delaying response times. IT teams need to continuously monitor their systems and applications to avert possible downtime and keep operations running smoothly. However a high number of incoming alerts inundating these teams can make them less responsive. The ramifications of such disregard can severely affect the efficiency and dependability of IT teams.

Should Every Incident Get a Retro?

At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague Ben Hartshorne asked a fascinating question, which I’ll paraphrase here: That caught me by surprise. We had a great discussion, and it made me consider approaches I hadn’t before.

9 incident management solutions to improve your workflows

Incident management is a team effort. While it's true that incident management should be seen as a company-wide effort, and you should empower all teams to declare incidents, this differs from the team effort I'm referring to here. No, incident management is a team effort in the sense that no one tool can do it all, not even incident.io. We covered as much when we discussed why we integrate with tools that can be seen as our competitors – and that’s OK!

8 Best IT Monitoring Tools and Software of 2023 (Updated)

Monitoring tools, also known as observability solutions, are designed to track the status of critical IT applications, networks, infrastructures, websites and more. The best IT monitoring tools quickly detect problems in resources and alert the right respondents to resolve critical issues. Response teams use observability solutions to gain real-time insights into resource availability, stability and performance.

Install Prometheus on Kubernetes: Tutorial & Examples

As one of the most popular open-source Kubernetes monitoring solutions, Prometheus leverages a multidimensional data model of time-stamped metric data and labels. The platform uses a pull-based architecture to collect metrics from various targets. It stores the metrics in a time-series database and provides the powerful PromQL query language for efficient analysis and data visualization.

Easier, Leaner, and a more reliable Status Page

Our status page product started last year as an experiment. We built a status page product in a hurry over weekends, and to our surprise, it gained a lot of traction. People were using it and giving us feedback, which helped us improve the product over time. And this year, we're thrilled to announce that we have great things planned for our status page product! The new revemped dashboard is part of a larger plan for our status page product. Here's a quick gist of the multiple releases.

Battling database performance

Earlier this year, we experienced intermittent timeouts in our application while interacting with our database over a period of two weeks. Despite our best efforts, we couldn’t immediately identify a clear cause; there were no code changes that significantly altered our database usage, no sudden changes in traffic, and nothing alarming in our logs, traces, or dashboards. During that two-week period, we deployed 24 different performance and observability-focused changes to address the problem.