Operations | Monitoring | ITSM | DevOps | Cloud

5 tips for a successful on-call duty

On-call availability is crucial for many industries, especially in IT. With the growing reliance on IT systems and services, their availability directly impacts the success and satisfaction of customers. To ensure round-the-clock availability, on-call services are vital for prompt responses to emergencies and issues.

Why Clearco switched to Grafana Alerting, Grafana OnCall, and Grafana Incident

Working with technology means dealing with incidents or outages from time-to-time, so staying on top of problems is essential. Back in the spring of 2022, Clearco, the world’s largest e-commerce investor, had an alerting system set up to catch issues, except they had one problem: Clearco’s Customer Success team would learn of a problem before a notification even went off.

Prometheus Alertmanager best practices

Have you ever fallen asleep to the sounds of your on-call team in a Zoom call? If you’ve had the misfortune to sympathize with this experience, you likely understand the problem of Alert Fatigue firsthand. During an active incident, it can be exhausting to tease the upstream root cause from downstream noise while you’re context switching between your terminal and your alerts. This is where Alertmanager comes in, providing a way to mitigate each of the problems related to Alert Fatigue.

Suppression Rules in Squadcast | Minimise Alert fatigue | Suppress Non-Actionable Alerts | Squadcast

This video talks about Alert suppression in Squadcast. Alert Suppression helps you avoid alert fatigue by suppressing notifications for non-actionable alerts. Squadcast will suppress the incidents that match any of the Suppression Rules you create for your Services. These incidents will go into the Suppressed state and you will not get any notifications for them.

Maximizing IT Company Success through Effective On-Call Support

Having your systems monitored by a reliable solution is important, but how do you ensure that the right people are informed about issues that arise? Identifying problems is the first step, but they also need to be routed to the appropriate individuals. Keep in mind that employees may not always be sitting in front of the dashboard. This means being available outside of normal working hours to quickly respond to emergencies and problems, including not only weeknights but also weekends and holidays.

Common Incident Terminology

Operations, customer support, engineers and most groups use inconsistent language. This is a serious problem. Imagine NASA doing that with astronauts or a navy with ships talking to each other, but not using the same terms. Something very bad will happen. In our space of incident management, we use words like broke, failed, outage, doesn’t work, dead…all describing the same condition.

Top 5 Tools for SRE 2023 (Updated)

Site reliability engineers (SREs) are involved in scaling systems and making them reliable and efficient for organizations. But SREs often fail to build system resiliency when they do not have the right tools at their disposal. In this post, we’ll uncover the top 5 tools for SRE that can be used to drive the reliability and stability of software systems. It also examines how SREs can use the tools to improve operations tasks and infrastructure processes.

Enterprise Alert 9.4.1 comes with fixes and the revised version of the sentinel connector app

In this release, we have addressed a number of bugs that were impacting the performance and functionality of the system. In the Kernel, we have resolved an issue where the broadcast was not being stopped after the first user acknowledged it. Additionally, we have fixed a crash that was occurring when loading component infos and an error log that was being generated when the Kernel started in suspended mode.