Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

Site Reliability Engineering-Why you should adopt SRE

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Severity Matrix Updates

We’re on a mission to make responding to incidents a bit less chaotic. One of the best features we offer (we’re definitely not biased, no way) is a simple way to define how a severity gets determined when you open an incident. We call it the severity matrix, and today it has a new look. Previously, we had a preset list of conditions and impact that allowed you to pick a severity that matched them.

From Mayhem to Modernization: The Evolution of Critical Incident Management

Let’s face it, managing a critical incident has never been a walk in the park. Even, in the “good old days,” before the great cloud revolution and the onslaught of digital transformations, an incident often meant mayhem. Processes were manual, time consuming, difficult to execute, document, and learn from. Getting all the right people in the “same room” at the right time – was nearly impossible. Lots of time was wasted chasing down the right folks.

Top 10 I&O Technologies for a successful 2020, 2021, 2022, 2023 & 2024

Each year there comes a time to look forward and think about next year and maybe even further. This can be a daunting task, especially in the fast-changing IT industry. Luckily, Gartner prepared a list of the top 10 technologies that will drive the future of Infrastructure and Operations up through 2024. This list might come in handy when you’re preparing your 2020 roadmap and beyond.

Smart SLO Alerting With Wavefront

Back in the good old days of monolithic applications, most developers and application owners relied on tribal knowledge for what performance to expect. Although applications could be incredibly complex, the understanding of their inner workings usually resided within a relative few in the organization. Application performance was managed informally and measured casually. However, this model falls apart in a microservices world.

How to use CloudWatch to generate alerts from logs

There are more than a million people using Amazon Cloud products, so it follows that many customers are employing an AWS integration with their Opsgenie instance. One common use case involves creating Opsgenie alerts from CloudWatch Logs to help stay ahead of issues and prevent incidents. CloudWatch Logs is an AWS log storage and monitoring feature that collects logs from all systems, applications, and AWS services in a single place.

The State of Unplanned Work: Key Findings

It’s a new world order: Skynet has taken over. Just kidding. But it sometimes feels that way, doesn’t it? In the words of Marc Andreessen, software is eating the world, and technology problems are now business problems. This means developers are now the architects of the digital experience and, by extension, the customer experience—and when said developers are unable to innovate quickly, companies are more exposed to competitive threats.

Why Escalations are Important to Clinical Communications

Unexpected events make the healthcare profession one of the most challenging industries to navigate and plan for. Sudden, abrupt patient situations tend to occur, increasing the workload of healthcare providers. Similar, process efficiencies and productivity are a reflection of the care team’s ability to communicate. When teams are on the same page, patient wait times are significantly reduced and results are improved.