Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Building a Multi-Tenant Insurance Platform

In 2020, CoverWallet—a multi-tenant insurance platform—was acquired by Aon, which led to a rapid expansion in both the size and global presence of its engineering organization. In his talk, CoverWallet’s Hylke Alons walks through the changes that were necessary to meet their platform's new expectations, including improving growth and scalability while ensuring reliability, automating security, and reducing maintenance. He also discusses some best practices for scaling up engineering and product teams to handle demand in a complex and highly regulated industry like insurance.

Grafana 9.2: Create, edit queries easier with the new Grafana Loki query variable editor

As part of the Grafana 9.2 release, we’re making it easier to create dynamic and interactive dashboards with a new and improved Grafana Loki query variable editor. Templating is a great option if you don’t want to deal with hard-coding certain elements in your queries, like the names of specific servers or applications. Previously, you had to remember and enter specific syntax in order to run queries on label names or values.

The Human Element of Tech Development

Opportunities for growth are all around us, but it takes the ability to be open and an eager growth mindset to see them. In this episode, David Noblet, Co-Founder + Chief Architect at ChaosSearch, shares how he and his team find innovative ways to improve digital services for their clients by constantly taking inspiration from their daily lives.

Pipeline Profiling: Or How I Learned to Stop Worrying and Isolate the Problem

It’s that time of year again! If you’re not a procrastinator, you’ve probably already blown out your sprinklers for winter and are looking forward to the snow and holidays ahead. Well done, irrigation purists! I, on the other hand, am an olympic-level procrastinator and will usually wait until the last moment before NWS forecasts a 10″ snow for the night then frantically search for my air compressor.

Generate RUM-based metrics to track historical trends in customer experience

Datadog Real User Monitoring (RUM) provides end-to-end visibility into the user experience and performance of your browser and mobile applications. RUM allows you to capture and retain complete user sessions for 30 days. This means you can pinpoint bugs, prioritize issues, and determine fixes with data collected across an entire quarter.

5 Reasons Why OpenTelemetry is the Future of Observability

It has been said that open source is eating the world and in the observability space, the project behind this movement is OpenTelemetry. The project is quickly becoming the standard for instrumentation and collection of observability data. Why is an open standard and open-source approach to instrumentation and data collection so compelling? This talk will provide five reasons why OpenTelemetry is disrupting the observability market.

My Most Surprising Discoveries from The SRE Report 2023

I’ve had the honor and privilege of authoring The SRE Report for the last three years. For the 2023 version, this included working with some amazing individuals like Anna Jones, Kurt Andersen, and Steve McGhee. Download The SRE Report 2023 here (no registration required).

Reducing MTTR for DevOps and SREs with PagerDuty Process Automation and InfluxDB

Mean time to resolution (MTTR) is a metric that transcends industry and technology. It’s a measure of how quickly, on average, support teams identify, act, and resolve IT issues and incidents. Because MTTR directly relates to service quality, maintaining a low MTTR is a critical goal for DevOps and SRE teams. These teams have a vested interest in resolving issues quickly because escalating incidents to higher levels of the support team increases response and resolution times.

How Do You Measure Application Performance?

Web performance isn’t just about how long a website needs to render all its page elements—it also covers techniques for monitoring an application’s runtime, user-defined transactions, component response times, and network requests. The important thing is using performance data to evaluate the success of your app or service, whether you’re trying to compare different versions or introduce new capabilities.

Reduce Data Costs: Log Sampling with OpenTelemetry and BindPlane OP

Redundant logs are a common nuisance in observability pipelines of all kinds. In large environments, excess logs can multiply data costs to unsustainable amounts. Log sampling is the process of randomly sampling logs to produce the same valuable insight with dramatically reduced data flow. Configuring agents in a pipeline to appropriately sample logs can be a pain. Pipeline managers, like BindPlane OP, make that process simple and scalable.