Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Sponsored Post

Taking down (and restoring) the Raygun ingestion API

In a world where Software as a Service (SaaS) products are integral to daily life, maintaining uninterrupted service for end-users is paramount. However, stuff happens. When it does, our most valuable response (other than restoring service ASAP) is to review the series of events that led up to the incident and learn from them. On August 25th, 2023, at 7:02 AM NZT, Raygun experienced a significant incident that impacted our API ingestion cluster, leading to an outage lasting approximately 1 hour and 15 minutes. While this wasn't fun for anyone involved, this incident did prove to be a valuable learning experience, shedding light on the importance of infrastructure management and resilience.

Status Pages That Deliver: Top 10 Favorites

Status Pages represent an invaluable asset for websites and SaaS businesses, particularly in today's environment with prevalent outages and heightened user expectations for seamless uptime. Integral to any robust website monitoring strategy, these pages serve as centralized hubs, offering users a singular, authoritative source for tracking the status of websites and applications.

Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

This blog post is adapted from my talk at SRECon EMEA 2023 - original slides are available here! Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right.

PagerDuty and Jeli Together Will Transform Incident Management

Today is an important day for us at PagerDuty, and for the larger ecosystem of incident management. We’ve signed a definitive agreement to acquire Jeli, a standout player in the incident management space. This deal represents a strategic alignment of visions, technologies and goals that will have a lasting impact on the industry and our customers.

Basics of Incident Management

Life is full of unexpected incidents. From the coffee spill that disrupts your morning routine to the sudden traffic jam that transforms a 20-minute commute into an hour-long ordeal. Much like these challenges, most of our systems and infrastructure also constantly face these tiny glitches. If ignored, they can have a significant impact. Unlike minor inconveniences, these glitches we call Incidents have the potential to disrupt your business, frustrate customers, and eat into your revenue.

Set Responders Up for Success with New User Onboarding

Effective incident response plays a critical role in maintaining smooth operations at organizations of all sizes. When built up correctly, operational resilience–that ability to bounce back quickly after failure–can act as a shield that guards your customer experience, ensuring that even when incidents inevitably happen, you’re back online in no time.

PagerDuty Operations Cloud Fall Launch 2023

Across the business landscape, 2023 has been called the “year of efficiency.” Organizations have had to deliver more growth and innovation, but with tighter budgets and headcount than in prior years. CIOs have needed to build strategies to mitigate the risk of operational failure and protect their brand’s customer experience.

Interlink's Service Chain Mapping solution: Helping Banking & Finance Organizations Strengthen Operational Resilience and Meet Regulatory Requirements

Operational resilience is an increasing area of focus and scrutiny for regulators of the banking and financial services industry. In the European Union, the Digital Operational Resilience Act (DORA) looms on the near horizon - with equivalent regulatory frameworks slowly but surely rolling out across the globe.

StatusCast vs Status.io: Status Page Comparison

In the modern day IT landscape, service reliability is of the utmost importance. Status pages serve as crucial interfaces, communicating any interruptions or issues to stakeholders. While several options are available, two notable status page providers stand out: StatusCast and Status.Io. Here we take a dive into the various aspects of status pages and incident management for each status page service.

Tips To Never Miss An Incident Notification With Squadcast Escalations Policies

Companies implement an Incident Response process to promptly resolve critical issues. Setting up escalation policies to notify engineers is a key step in this process. With traditional escalation policies, alert notifications still get missed which results in higher response times and failure to meet SLAs. So, how can one ensure incident notifications are never missed?