Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

How to Introduce Automation to Incident Response with Slack and PagerDuty

Major-incident war rooms are synonymous with stress. Pressure from executives, digging for a needle in a haystack, too much noise—it’s all weight on your hardworking technical teams. Incident responders clearly need a more effective way to collaborate across various technical teams. A method that both minimizes interruptions and keeps stakeholders up to date while ensuring everyone has the right level of context to do their job.

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

‍Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

Leverage Observability With OpenTelemetry to Understand Root Cause Quickly

An observability solution should help any incident responder understand what changed and why. A lot has been written on the difference between monitoring and observability, but an easy way to understand how both are integral to incident response is to consider how customers use PagerDuty—with both monitoring and observability tools—to get to the right answer.

SREview Issue #14 June 2021

Hoping you're headed towards a fun summer season and some time without masks. Let's avoid a new kind of tan-line! This newsletter shares useful industry content and an exciting Blameless product announcement. Find our fave tweets and events in the SRE and resilience engineering community. We're hiring! Check out the job openings here.

Red Canary says 43% Lack Readiness to Notify Customers of a Security Breach

The phrase ‘stakeholder management” assumes that stakeholders are truly informed by alerts. However, managers can only send communications out, they cannot force people to address them. To ensure your stakeholders are engaged during an incident, it is vital to set up a defined communication process. Yet, a recent Red Canary report1 found that 43% of surveyed participants lack readiness to notify the public and/or its customers in the event of a security breach.

Everything You Need to Know About Emergency Risk Management

Emergency risk management (ERM) is the process of identifying potential threats and minimizing the impact of disasters on business operations and people. The process requires leaders within an organization to determine how they will keep stakeholders informed and safe during critical events. Leaders must also craft disaster recovery plans to quickly remedy the effects of a catastrophic event on communities, government agencies and organizations.

Monthly Moo Update | May 2021

Goodbye May, Hello June! It’s summertime in the northern hemisphere and the sun is shining bright, along with updates we’ve got for you this month. The team at Moogsoft is working on a few big items that will be sure to put a smile on your face. But, lest we forget about some of the smaller items that help you day in and day out.

Manage incidents on the go with the Datadog mobile app

The Datadog mobile app enables you to check your alerts and dashboards from anywhere, so you can triage issues—and stay up to date—regardless of whether you have access to a laptop. You can now be even more productive when responding to issues while away from your keyboard by declaring incidents and notifying responders directly from your mobile device.