Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Getting AWS CloudTrail alerts via SNS Endpoint

Logging and auditing have been an essential part of troubleshooting application and infrastructure performance. You can instantly spot areas of risk to ensure quick correction and prevention of issues. In this blog, we will explore the AWS CloudTrail service and discuss how integrating it with Squadcast can help you route alerts to the right users for quick and efficient incident response. Let's get started.

xMatters Notification Override Feature

Now you can sleep easy knowing xMatters notification override will let you know when a critical alert happens, regardless of your device status. Discover more about how xMatters can help ensure applications are always working, automate workflows, and deliver remarkable products at scale with the xMatters service reliability platform.
Sponsored Post

Simplifying SLO and Error Budget tracking for SRE teams

Service level objectives (SLOs), and the subsequent service level indicators (SLIs) are the foundation to establishing a strong SRE culture and how they promote accountability, trust and timely innovation. We are on a mission to simplify SLO and Error Budget tracking and with that aim in mind, we have added the SLO Tracker feature to the Squadcast platform. SLO Tracker seeks to provide a simple and effective way to keep track of your error budget burn rate without the hassle of configuring and aggregating multiple data sources.

5 Tips If You're the 1st SRE Hire by Instacart's First SRE

Site Reliability Engineers (SREs) have a considerable set of tasks to juggle no matter where they work or how long their company has had an SRE practice. But if you’re the very first SRE to join an organization – as many SREs are these days, given that the SRE trend is trickling down into smaller and smaller companies – you face a special group of challenges. You may find it difficult to get buy-in for SRE from other technical teams.

Introducing Incident Types

We believe incident.io should be used across an organisation, from SRE teams to Customer Success and People Ops. Until now, the way you set up your incident response flows has relied on having one set of roles and fields for every incident, meaning you have to choose between having lots of irrelevant fields to cover every use-case, or not getting the full incident.io experience on some incidents. That’s changing today with incident types, conditional fields and roles!

Webinar: combating tool sprawl with AIOps

Dexcom is more than a business. For its customers, the organization’s innovative continuous glucose monitoring platform provides them with a way to take control of their health and better manage their diabetes. Given the critical services Dexcom provides to its customers, their IT Operations teams have highly specific needs when it comes to the many tools and platforms, they rely on to keep their organization’s services up and running.

We can't all be Shaq: why it's time for the SRE hero to pass the ball and how to get there

At a going away party from a job I was leaving a few years back, my VP of engineering told a story I didn’t even remember but that I know subconsciously shaped how I viewed my role on that team: Toward the end of my very first day at the company, there was some internal system issue, and with pretty much zero context, I pulled out my laptop, figured out what was going on, and helped fix the issue.

When incident response requires business response, who should you notify?

From a single on-call engineer hopping online to resolve a problem, to a massive cross-team effort that brings in even the most senior technical leadership (CTO, CISO, or CIO), incident response teams are lucky when they’re able to resolve issues before a customer is aware. But in the cases where there is customer impact, other stakeholders like sales and customer service need to be informed and updated as well.