Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Getting over on-call anxiety

You've joined a company, or worked there a little while, and you've just now realised that you'll have to do on-call. You feel like you don't know much about how everything fits together, how are you supposed to fix it at 2am when you get paged? So you're a little nervous. Understandable. Here are a few tips to help you become less nervous.

Experiencing Turbulence? Hypercare Helps Travel and Hospitality Firms Manage Sky-High Demand

Many sectors suffered during the COVID-19 pandemic, but the travel and hospitality industry was struck particularly hard as the world went into lockdown and governments urged us to stay home. According to the International Air Transport Association, global air passenger demand in 2020 was down a record 65.9% from the previous year, and the tourism industry saw an estimated loss of 100.8 million jobs worldwide.

How to Reduce Alert Fatigue: Preventing Noisy Alerts and Error Messages

Monitoring solutions are a vital component in managing an application’s environment. From the systems layer all the way up to the end user’s connection to the app, you want to find out how the platform is performing. Indicators like CPU, memory, the number of connections, and overall health help teams make informed decisions for guaranteeing uptime. Teams monitor metrics (short-term information) and logs (long-term information) mainly from a reactive perspective.

How Grafana helps organizations manage SLOs across multiple monitoring data sources

“SLO is a favorite word of SREs,” Grafana Labs Principal Software Engineer Björn “Beorn” Rabenstein said during his talk at KubeCon + CloudNativeCon NA 2019. “Of course, it’s also great for design decisions, to set the right goals, and to set alerting in the right way. It’s everything that is good.” So what happens when things go bad?

Monitoring and Alerting 101: Monitoring Best Practices

An effective monitoring system is paramount to smooth business operations. As the need for a fast, responsive software experience gains momentum, monitoring becomes an indispensable driving force. Monitoring systems enable IT teams to proactively observe the health and responsiveness of critical environments and applications. Without monitoring, organizations must depend on customers or internal departments to receive notice of system issues.

PD Summit21: Transforming Infrastructure Teams Through Observability

What is this ""observability"" thing that everyone is talking about? Observability allows you to navigate the dark unknowns with echolocation while others attempt to fly blindly without it. Are your dashboards all green, but you still have an issue brewing? Do you need instant feedback based on the Core Analysis loop? Are your engineers tired of waking up at 3 AM for the expected issues? Is there a lack of time for experimentation? Generate your own answers and create a meaningful course of action with observability.

PD Summit21: The Netflix Reliability Story: A Brief History of How We Evolved Resilience to Failure

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. We strive to provide a service that people love and can enjoy anytime, anywhere. An important foundation for bringing our customers joy is a strong focus on reliability that ensures Netflix will be available when they need it. In this talk, I’ll tell the story of how we've grown our reliability practices over time to meet the changing demands of microservices and distributed computing.

PD Summit21: Adopting and Maturing to Service Ownership with PagerDuty and Rundeck

Among the common goals of today's engineering and operations teams is to adopt a culture of service ownership: ""You build it, you own it."" As with many ancillary objectives to driving DevOps across an organization, this is easier said than done. Sometimes this is in small part due to the technology stack/architecture of a given company. But more often than not, this is because teams lack the human-to-technology mechanisms that allow for a culture of service ownership.

PD Summit21: Migrating to L1 Support to PagerDuty

Learn how Maersk transitioned from operating with an L1 support team to using PagerDuty to drive an efficient operational support model. In this talk you will learn how implementing PagerDuty within the platform SRE team was part of a major re-org with the goal of driving a new operations model for a highly available (99.999%) platform that lead to outstanding results. At Maersk, we saw increased efficiencies and reduced TTR along with other significant advantages of using PagerDuty from both on-call and management perspectives.