Latest News

VictorOps and Relay for Incident Response

Oct 19, 2020 By Eric Sorenson In Puppet

VictorOps is an incident response tool whose mission is straightforward: “To make being on call suck less.” It enables teams to quickly detect and respond to problems like a service degredation or outage. VictorOps supports a wide range of external integrations to extend its capabilities by connecting different parts of your DevOps toolchain.

Read Post

Puppet

Read more about VictorOps and Relay for Incident Response

SREview Issue #6 October 2020

Oct 16, 2020 By Blameless Community In Blameless

BOO! Did we scare you? We couldn’t help it, we’re just so happy it’s spooky season. Here’s the October issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Read Post

Blameless

Read more about SREview Issue #6 October 2020

Incident Ready: How to Chaos Engineer Your Incident Response Process

Oct 16, 2020 By The FireHydrant Team In FireHydrant

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, shared how our customers leverage best practices to break, mitigate, resolve, and fireproof incident processes.

Read Post

FireHydrant

Read more about Incident Ready: How to Chaos Engineer Your Incident Response Process

Microsoft's 3 major incidents in 10 days, where did they go wrong?

Oct 15, 2020 By Noam Morginstin In Exigence

Just in case you haven’t heard, last week Microsoft experienced a huge outage that prevented users from accessing its Office 365 cloud-based subscription service which serves 200 million active monthly users. This latest outage was the third in ten days, causing the company to receive a deluge of customer complaints about a 'something went wrong' message that popped up when they tried to access their accounts.

Read Post

Exigence

Read more about Microsoft's 3 major incidents in 10 days, where did they go wrong?

October 2020 Update: Mute overwrite for iPhone (Critical Alerts), undo and more

Oct 14, 2020 By René In SIGNL4

Our October update brings the long-awaited mute-overwrite on iPhone (‘critical alerts’). We also introduce an undo action for Signl acknowledgements or closures. And in the web app you can now batch-ack and close to multiple Signls at once. All new features are introduced below – enjoy.

Read Post

SIGNL4

Read more about October 2020 Update: Mute overwrite for iPhone (Critical Alerts), undo and more

Can Security Teams Benefit from SRE? You bet!

Oct 13, 2020 By Emily Arnott In Blameless

When we talk about the reliability of services, SRE encourages us to take a holistic view. Unreliability in service delivery can be due to anything, from hardware malfunctions to errors in code. One source of unreliability that is often overlooked is security. A security breach can damage customer trust far beyond the impact of the breach itself. Even smaller infractions, like failing a service audit, can make users wary.

Read Post

Blameless

Read more about Can Security Teams Benefit from SRE? You bet!

PagerDuty Summit: Lacework on the Shared Irresponsibility Model of Cloud Security

Oct 13, 2020 By PagerDuty In PagerDuty

Cloud security has become increasingly complex of late. Cloud providers use tens of thousands of APIs, container orchestration systems are growing in number and complexity, and more platforms and services are entering the cloud-native ring. What’s more, each of these components pose a potential security risk to organizations. And it’s you as the customer that’s responsible for the configuration and security of those components.

Read Post

PagerDuty

Read more about PagerDuty Summit: Lacework on the Shared Irresponsibility Model of Cloud Security

Site reliability engineering-what is SRE?

Oct 11, 2020 By Amrit Balraj In Zenduty

As companies today are racing to build site reliability engineering(SRE) practices within their engineering teams, site reliability engineering has become one of the hottest and highest paying jobs in tech. Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional.

Read Post

Zenduty

Read more about Site reliability engineering-what is SRE?

How SIGNL4 provides for a digital handover procedure

Oct 9, 2020 By Matt In SIGNL4

Handover procedures in operations and maintenance are a key element of business continuity. As work in this field is usually organized in shifts, it is essential to keep track of any critical incidents, machine breakdowns, job ownership, completion, issues that are still open or unresolved and other related items. Such knowledge has a significant impact on a timely or even proactive response, for instance if issues re-surface.

Read Post