Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Meeting customer support SLAs on Freshdesk using proactive alerting and escalations with Zenduty

As businesses close more deals and add more accounts, it is still imperative for businesses to maintain their SLA levels and resolve customer support tickets within SLA timeframes. Having a solid support team is great, but supporting hundreds or thousands of users in the most efficient, cost-effective way while maintaining SLAs continues to be a challenge for the majority of companies. An SLA policy ( service level agreement) lets you set standards of performance for your support team.

Optimizing your alerts to reduce Alert Noise

Reducing alert fatigue starts from your monitoring platform - setting the right thresholds to trigger alerts and understanding which of these are essential to be sent into your on-call platform is a start. This post outlines some of the best practices that help you reduce alert noise and improve your on-call experience. The word noise implies something unpleasant and unwanted. You combine that with on-call and it adds a factor of annoyance to the already overwhelming process.

Challenges Faced by MSPs in Light of COVID-19

The COVID-19 crisis has proven to be a challenging time for IT support teams and managed service providers (MSPs). It hasn’t only left these organizations in a vulnerable position, but also in a state of uncertainty as to what may be in store for them. OnPage interacts with current and prospective clients ranging from large businesses to small and medium enterprises (SMEs).

Getting SRE Buy-in from a Manager or Lead for Incident Response, Part 1

Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and everything in between. In this blog series, we will walk you through how to come up with a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

Virtualize the NOC: Accelerate Your Transition to Remote IT Ops with AIOps

The sudden shift to remote work caused by the global pandemic has forced IT Ops pros to quickly adjust in multiple ways to maintain the uptime and stability of critical digital services. Amidst this crisis, AIOps has emerged as a lifeline, as it facilitates remote collaboration, streamlines incident management, and accelerates detection and resolution.

Resilience in Action, Episode 1: Narratives in Incidents with Lorin Hochstein

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Blameless Staff SRE Amy Tobey. Amy has been an SRE and DevOps practitioner since before those names existed. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others. In our very first episode, Amy chats with Netflix software engineer Lorin Hochstein.

Collaborate through chaos with Opsgenie's new Slack app for incidents

Get stories like this in your inbox During an IT incident, every second counts – but the first few minutes are the most critical. Teams who can rapidly spin up the right tools and processes have the best shot at fast resolution. And of course, many teams rely on chat tools to collaborate and communicate during incidents. So we’re excited to announce our new Slack app for Opsgenie Incidents.