Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

On-call doesn't have to be stressfull

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.

The Age of Service Mesh

You have built a massively successful system. The users just can't get enough and request new features. Your developers crank out new services on a regular basis. Your DevOps/SRE team configures and scale your Kubernetes cluster (or clusters). As the system becomes more complicated and sophisticated you realize that there are common themes that repeat across all your services.

Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family on the moon. How can teams climb out of it? How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control? The rope out of pager hell is weaved with a thorough and rigorous postmortem process.

Sensitive Medical Data Hacked by Unsophisticated Software

There’s a solid rationale behind replacing antiquated technology, as they fail to keep pace with how the healthcare environment is evolving. One such invention is the good, old pager. Recently, the U.K.’s National Health Service Trust (NHS) was on the radar when the organization’s sensitive medical data was hacked by an individual in North London. The malicious party intercepted radio waves, converting it into legible text on his computer monitor.

IDC Finds Substantial ROI for Enterprises Using PagerDuty for Digital Operations Management

In order to keep digital services running around the clock, teams need to be able to solve problems faster—or, ideally, in real time. Many vendors claim to provide value and help organizations bolster their digital operations management.

Root Cause Changes: Real Examples of Modern Root Cause Analysis from our Beta Customers

Root Cause Analysis (RCA) is an all-encompassing process. It is usually very complicated and often requires many people with many different skills – all trying to tackle an incident to determine what happened, when, why, how and ultimately who (to blame). There is, however, secret sauce today that can help solve many issues before a “full-scale” RCA process is initiated – and that is Root Cause Changes (RCC).

Advanced alerting and anywhere alert management for Azure Monitor

Have you ever wanted to get important alerts from Azure Monitor notified on your smartphone and have all the important details of the problem at your fingertips? Have you ever missed the option to easily change the status of alerts from Azure Monitor in the Azure smartphone app? Ever missed a push notification because there was no persistent and advanced alerting? Then this article is for you.