Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Fail-Safe Digital Scheduler for On-Call Management

In this video, we discuss how OnPage's advanced, fail-proof digital schedules enable organizations to distribute workload evenly among scheduled, On-Call team members. The OnPage scheduler starts out "FULL" and schedules are created on top of it. This guarantees that a notification is delivered reliably, even when a slot is left empty on the scheduler. The scheduler reverts to the default group order and the entire group is notified, ensuring continuous coverage across your organization.

Tis The Season: Protect Your Availability During The Holidays

Deck the halls! It's time for the annual holiday Code Freeze, that festive time of year when businesses impose a precautionary halt to code changes and Operations should be quiet. But before you kick up your feet, make sure that demand doesn’t lead to availability embarrassments. After all, retail experts suggest that we’re in for another online-heavy holiday shopping season, so businesses need to brace for increased digital traffic...with little tolerance for failure.

Partner Integration on Twitch: Lacework

Lacework delivers complete #security and #compliance for the cloud. While the cloud enables enterprises to automatically scale workloads, deploy faster, and build freely, it also makes it increasingly difficult to: maintain visibility, remain compliant, stay free from known vulnerabilities, and track activity in both host workloads and ephemeral infrastructure within their environments. Integrate Lacework with PagerDuty to route Lacework Events to responders on your team. Manage and resolve configuration issues, behavioral anomalies, and compliance requirements in a timely manner across your cloud infrastructure.

How to Write Meaningful Retrospectives

One of the foundations of incident management in SRE practice is the incident retrospective. It documents all the learnings from an incident and serves as a checklist for follow-up actions. If we step back, there are 7 main elements to a retrospective. When done right, these elements help you better understand an incident, what it reveals about the system as a whole, and how to build lasting solutions.

5 ways incidents made me a better engineer

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems. In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team.

Fall 2021 Launch: Automate Incident Response to Accelerate Critical Work

Modern businesses are digital businesses—so managing your business means mastering your critical services and operations for your employees and customers. Today, you need to be able to understand every aspect of your company—as it unfolds—because in this world, seconds matter to your productivity, your revenue, and most importantly, your customers.

Mobile Service Dispatching for In Plant Transport Logistics at BASF Coatings

BASF is the largest chemical producer in the world with a revenue of EUR 59bn, 247 manufacturing sites and 110,000 employees. BASF’s Coatings division employs 11,000 people and develops, produces and markets innovative solutions for automotive OEM and automotive refinish coatings and industrial coatings as well as architectural coatings and related coating processes.

IT Failures are Inevitable

As infrastructure stacks grow increasingly complex and involve an ever-growing number of services, system failures are becoming more and more common. There can be a variety of reasons why systems fail: software bugs, misconfiguration or interactions between services that cause unexpected behavior, the network is down, and of course, those rare occasions where natural events can render data centers inoperative.