Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Response Alert Routing

You have identified a data breach, now what? Your Incident Response Playbook is up to date. You have drilled for this, you know who the key players on your team are and you have their home phone numbers, mobile phone numbers, and email addresses, so you get to work. It is seven o’clock in the evening so you are sure everyone is available and ready to respond, you begin typing “that” email and making phone calls, one at a time.

7 Ways SRE Is Changing IT Ops And How To Prepare For Those Changes

SRE best practices are disrupting and catalyzing change in the ways organizations approach IT Operations. In this blog we look at 7 ways SRE is bringing this transition. ‍Site Reliability Engineering is a new practice that has been growing in popularity among many businesses. Also known as SRE, the new activity puts a premium on monitoring, tracking bugs, and creating systems and automations that solve the problem in the long term.

4 Major Capabilities of Automated Incident Management

Automated incident management ensures that critical events are detected, addressed and resolved in a fast, efficient manner. Automation allows incident management tools to integrate with each other and fosters instant communication across the systems. Automation tears down barriers across IT operations (ITOps) teams and ensures all departments are on the same page. Teams gain full visibility into incident status to verify that incidents are addressed by the relevant groups.

SRE Leader Panel: Business Agility is what matters, SRE can help you get there

Ready for another SRE Thought Leader Panel? This one is themed, Business Agility is what matters, SRE can help you get there. We’re chatting about topics like the value of crisis during incident response, the best and worst tech transformations we’ve seen, how reliability impacts the flow of value, and more. This panel is hosted by Chris Hendrix, staff software engineer at Blameless and features guests.

Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

Incident response is overwhelming. So where do you start? There's a lot of advice out there, but it's mostly theories that aren't taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share quick stories from his experience as an SRE and what tips he’s learned along the way.

JFrog and PagerDuty Extend Ecosystem Integration

JFrog and PagerDuty have deepened their technology integration to further boost IT operators’ and developers’ visibility into the software development lifecycle and accelerate incident resolution. The latest integration, which involves the JFrog Pipelines DevOps pipeline automation solution, simplifies and streamlines how to identify faulty builds that impact production environments.

Introducing 2-way REST capabilities with Enterprise Alert 9

The REST API in Enterprise Alert 9 has now been extended with a 2-way functionality. This allows to call webhooks or REST endpoints from third party systems on alarm status changes (acknowledge, close). Thus, in Enterprise Alert 9, it becomes child’s play to establish a 2-way integration with almost any REST enabled third party system.

What is Site Reliability Engineering [Simple Intro to SRE]

Wondering what SRE is all about? We will explain what it is, how it works, why it was developed, and how it can help your organization. So what is SRE (Site Reliability Engineering)? SRE is a methodology that fuses software and operations teams, with the goal of producing reliable, resilient, and scalable systems. Site Reliability Engineering (SRE) was developed by Google engineer Ben Treynor Sloss in 2003. Google’s goal was to increase the reliability of its sites and services.