Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

10 Observability Tools in 2023: Features, Market Share and Choose the Right One for You

Understanding what's happening within your systems is a necessity. Have you ever wondered how experts keep an eye on systems to make sure everything's running smoothly? That's where observability tools come in! Observability tools are like helpers that give you a peek inside your tech. In this blog, we will talk about observability tools and how they can be used in different situations so it's easier for you to choose the right one for your organization.

But It's Not Our Fault! When Third-party Incidents Affect Your Service

Very few SaaS products exist completely independently. Between cloud service providers, payment processors, content delivery networks, and more, chances are you rely on external systems to keep your product working. When these systems fail, it can leave you feeling pretty helpless. In some cases you might have fallback options, but oftentimes all you can do is wait for recovery and clean up the fallout.

Azure Monitoring Agent: Key Features & Benefits

In today's rapidly evolving digital landscape, businesses increasingly rely on cloud computing and infrastructure to support their operations. As organizations migrate their workloads to the cloud, robust monitoring and management tools are paramount to ensure optimal performance, security, and efficiency. In response to this demand, Microsoft Azure has introduced the Azure Monitoring Agent (AMA), a powerful and versatile solution designed to enhance the monitoring capabilities of Azure resources.

Splashing into Data Lakes: The Reservoir of Observability

If you’re a systems engineer, SRE, or just someone with a love for tech buzzwords, you’ve likely heard about “data lakes”. Before we dive deep into this concept, let’s debunk the illusion: there aren’t any floaties or actual lakes involved! Instead, imagine a vast reservoir where you store loads and loads of raw data in its natural format. Now, pair this with the idea of observability and telemetry pipelines, and we have ourselves an engaging topic.

How To Write Incident Postmortems

Writing a public postmortem regarding an outage is essential to maintaining transparency and accountability when things go wrong in a service or system. The purpose of writing a postmortem is to analyze and document an incident or event that has occurred, usually with a focus on identifying its root causes, understanding what went wrong, and outlining steps to prevent similar issues from happening in the future.

Rootly Raises $12 Million from Renegade Partners, Google Gradient Ventures, & XYZ Ventures

We are excited to announce that we have raised a $12M round of financing led by Renegade Partners with participation from Google Gradient Ventures (Google’s AI-focused venture fund) and XYZ Ventures. This brings our total funding to date to $15.2M ($20M CAD) alongside our other existing investors Y Combinator and 8VC.

SRE in Transition: From Startup to Enterprise

"Startups are defined by “ship or die”. As a result, SRE teams at a startup should be focused on enabling product engineers to ship features as quickly as possible. As your startup transitions from “we’ll run out of money in the next 18 months” to “we have more than 1000 engineers”, how should the SRE organization evolve and provide the best value through that transition (including booting one up if you don’t have one)? I will discuss specific ways the organization needs to evolve to meet this challenge, how the SRE org can advocate for and support this change (both in direct actions and in “influence”), and how the overhang of startup technical and cultural debt can make this shift more challenging (but also more necessary).

Tools and Trends in Site Reliability Engineering according to Gartner's 2023 Hype Cycle

Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, report. This blog reviews the future of site reliability engineering based on Gartner’s Hype Cycle. Additionally, the OnPage team is pleased that Gartner mentioned OnPage as a sample vendor in the Automated Incident Response category.