September 2024

Project management à la SRE: How to juggle the needs of your project and production

Sep 28, 2024 By Karan Anand In Google Operations

Most IT project management frameworks are directed at single-focus teams like software development, not multi-focus teams like SRE.

Read Post

Google Operations

Read more about Project management à la SRE: How to juggle the needs of your project and production

Press Start to Scale: SRE in Gaming - Incidentally Reliable with Denys Pashutynski

Sep 27, 2024 By Zenduty In Zenduty

In our latest episode, we speak with Denys Pashutynski, Senior Engineering Manager of Site Reliability at Roblox, about the formidable challenges of sustaining a global gaming platform. Drawing from his tenure at Twitter, AWS, and eBay, Denys delves into managing traffic surges, latency optimization, and strategic change management. Exclusively on The Incidentally Reliable podcast, which is made by SREs for SREs and hosted by Zenduty.

View Video

Zenduty

Read more about Press Start to Scale: SRE in Gaming - Incidentally Reliable with Denys Pashutynski

Prometheus Recording Rules: A Developer's Guide to Query Optimization

Sep 27, 2024 By Prathamesh Sonpatki In Last9

This guide breaks down how recording rules can help, with simple tips to improve performance and manage complex data.

Read Post

Last9

Read more about Prometheus Recording Rules: A Developer's Guide to Query Optimization

Financial Benefits of Incident Management: Cost Savings and ROI

Sep 26, 2024 By Spandan Pal In Squadcast

Have you ever assessed the financial impact of an hour of downtime on your business? If not, the results might be more alarming than you expect. For large enterprises, the cost can easily reach millions-and that's only the beginning of the potential consequences. And that's just the tip of the iceberg.

Read Post

Squadcast

Read more about Financial Benefits of Incident Management: Cost Savings and ROI

How AI is Revolutionizing SaaS and Cloud Software: Key Trends for 2025

Sep 26, 2024 By Vishal Padghan In Squadcast

In recent years, artificial intelligence (AI) has ceased to be a mere technological trend and has established itself as a foundational element shaping the future of Software as a Service (SaaS) and cloud-based software solutions. By 2025, AI's integration into these domains will not just enhance existing functionalities but redefine what is possible in ways we’re only beginning to comprehend.

Read Post

Squadcast

Read more about How AI is Revolutionizing SaaS and Cloud Software: Key Trends for 2025

Tail Latency: A Critical Factor in Large-scale Distributed Systems

Sep 26, 2024 By Anjali Udasi In Last9

Tail latency significantly impacts large-scale systems. This blog covers its importance, contributing factors, and effective reduction strategies.

Read Post

Last9

Read more about Tail Latency: A Critical Factor in Large-scale Distributed Systems

Prometheus Rate Function: A Practical Guide to Using It

Sep 25, 2024 By Anjali Udasi In Last9

In this guide, we’ll walk you through the Prometheus rate function. You’ll discover how to analyze changes over time and use that information to enhance your monitoring strategy.

Read Post

Last9

Read more about Prometheus Rate Function: A Practical Guide to Using It

SRE vs. DevOps vs. Platform Engineering: Differences Explained

Sep 24, 2024 By Shanika Wickramasinghe In Splunk

SRE, DevOps and Platform Engineering are important concepts in today's world of software development. There are dedicated teams to manage these areas, each with a unique primary focus, set of responsibilities, tools and metrics used to gauge their performance requirements. This article explains SRE, DevOps, and Platform Engineering, including similarities and differences, and, most importantly, how these teams help streamline modern software development, delivery, and maintenance processes.

Read Post

Splunk

Read more about SRE vs. DevOps vs. Platform Engineering: Differences Explained

OpenTelemetry Collector: The Complete Guide

Sep 24, 2024 By Prathamesh Sonpatki, In Last9

This guide explains the key aspects of the OpenTelemetry Collector, including its features, use cases, and practical tips for managing telemetry data effectively.

Read Post

Last9

Read more about OpenTelemetry Collector: The Complete Guide

7 Best Practices for Effective Log Formatting

Sep 23, 2024 By Shubham Bhaskar Sharma In Zenduty

Logs play a critical role in monitoring your applications and systems in terms of health, system behavior, and problem diagnosis. However, logs can assuredly bring value only if they are structured and well-formatted. Effective log formatting can help identify an issue to fix on time rather than having to sift through unorganized, hard-to-read logs. In this blog, we delve into 7 super-effective practices for production logging to help you maximize your log analysis capabilities.

Read Post

Zenduty

Read more about 7 Best Practices for Effective Log Formatting

What is Log Monitoring? Complete Guide for 2025

Sep 23, 2024 By Shubham Bhaskar Sharma In Zenduty

In today’s complex environments such as cloud-native technologies, containers, and microservices-based architectures, reliable log monitoring is crucial for keeping your systems secure and resilient. Continuous monitoring enables organizations to stay in-control, providing proactive insights into system health and performance. With platforms like AWS, GCP, and Azure churning out massive amounts of logs, it’s easy to get overwhelmed.

Read Post

Zenduty

Read more about What is Log Monitoring? Complete Guide for 2025

Trusting AI for Incident Response: The Role of AI in Modern Incident Management

Sep 20, 2024 By Vishal Padghan In Squadcast

In an age where every second counts, the swift resolution of IT incidents can mean the difference between maintaining business continuity and enduring significant operational setbacks. As businesses increasingly embrace digitalization, the complexity and volume of incidents rise exponentially. This new reality calls for innovative approaches to incident management—ones that can manage the unpredictability, scale, and urgency of modern IT ecosystems. Enter artificial intelligence (AI).

Read Post

Squadcast

Read more about Trusting AI for Incident Response: The Role of AI in Modern Incident Management

Adding Cluster Labels to Kubernetes Metrics

Sep 20, 2024 By Prathamesh Sonpatki In Last9

A definitive guide on adding cluster label to all Kubernetes metrics.

Read Post

Last9

Read more about Adding Cluster Labels to Kubernetes Metrics

An Engineer's Checklist of Logging Best Practices

Sep 20, 2024 By Rox Williams In Honeycomb

The best DevOps and SRE teams have shifted their approach to monitoring and logging their systems. These teams debug problems cohesively and rationally, regardless of the system’s complexity. Gone are the days of having a slew of logs that fail to explain the cause of alerts, system failures, and other unknowns.

Read Post

Honeycomb

Read more about An Engineer's Checklist of Logging Best Practices

How to Use Jaeger with OpenTelemetry

Sep 19, 2024 By Anjali Udasi In Last9

This guide shows you how to easily use Jaeger with OpenTelemetry for improved tracing and application monitoring.

Read Post

Last9

Read more about How to Use Jaeger with OpenTelemetry

Prometheus Alternatives

Sep 17, 2024 By Gabriel Diaz In Last9

What are the alternatives to Prometheus? A guide to comparing different Prometheus Alternatives.

Read Post

Last9

Read more about Prometheus Alternatives

Identify root spans in Otel Collector

Sep 16, 2024 By Prathamesh Sonpatki In Last9

How to identify root spans in OpenTelemetry Collector using filter and transform processors.

Read Post

Last9

Read more about Identify root spans in Otel Collector

Optimizing Prometheus Remote Write Performance: Guide

Sep 16, 2024 By Gabriel Diaz In Last9

Master Prometheus remote write optimization. Learn queue tuning, cardinality management, and relabeling strategies to scale your monitoring infrastructure efficiently.

Read Post

Last9

Read more about Optimizing Prometheus Remote Write Performance: Guide

The Shift from SRE to Platform Engineering: Why It's the Future of Scalability and Innovation

Sep 14, 2024 By Alin Dobra In Bunnyshell

As technology evolves, so do the roles and strategies that drive software development and infrastructure management. One of the most significant shifts we’ve seen in recent years is the move from Site Reliability Engineering (SRE) to platform engineering. This change is reshaping how companies operate, from scaling their infrastructure to improving the developer experience.

Read Post

Bunnyshell

Read more about The Shift from SRE to Platform Engineering: Why It's the Future of Scalability and Innovation

The Future of SLOs in DevOps: Navigating Common Pitfalls in SLO Management

Sep 13, 2024 By Vishal Padghan In Squadcast

As the technology landscape continues to evolve, so do the methods by which organizations ensure optimal service delivery. Service Level Objectives (SLOs) have emerged as one of the most critical metrics in DevOps and Site Reliability Engineering (SRE), acting as a bridge between reliability and performance. SLOs reflect the target reliability of a service from the perspective of the user, providing measurable standards to maintain quality.

Read Post

Squadcast

Read more about The Future of SLOs in DevOps: Navigating Common Pitfalls in SLO Management

PromCon 2024 - Day 2

Sep 13, 2024 By Prathamesh Sonpatki In Last9

Catch up on Day 2 of PromCon 2024. Read about the key talks and takeaways from the second day of this exciting event.

Read Post

Last9

Read more about PromCon 2024 - Day 2

Top 10 Platform Engineering Tools in 2024

Sep 13, 2024 By Prathamesh Sonpatki In Last9

Check out these 10 tools that are making a real difference in how teams build, manage, and scale their platforms in 2024.

Read Post

Last9

Read more about Top 10 Platform Engineering Tools in 2024

Golang Logging: A Comprehensive Guide for Developers

Sep 13, 2024 By Prathamesh Sonpatki, In Last9

Our blog covers practical insights into Golang logging, including how to use the log package, popular third-party libraries, and tips for structured logging.

Read Post

Last9

Read more about Golang Logging: A Comprehensive Guide for Developers

Developer's Guide to Installing OpenTelemetry Collector

Sep 13, 2024 By Prathamesh Sonpatki In Last9

Learn how to install and configure the OpenTelemetry Collector for enhanced observability. This guide covers Docker, Kubernetes, and Linux installations with step-by-step instructions and configuration examples.

Read Post

Last9

Read more about Developer's Guide to Installing OpenTelemetry Collector

Jira and ServiceNow: A Comparative Analysis for Effective Incident Management

Sep 12, 2024 By Spandan Pal In Squadcast

Incident management isn't just a buzzword—it's critical to keeping operations running smoothly. When systems fail, the ripple effects can be costly. For enterprises, maintaining service continuity and keeping customers satisfied depends on quick, efficient incident responses. That's where tools like Jira Service Management (JSM) and ServiceNow come in.

Read Post

Squadcast

Read more about Jira and ServiceNow: A Comparative Analysis for Effective Incident Management

How the Cribl SRE Team Uses Cribl Edge to Collect Metrics

Sep 12, 2024 By Bill Chung In Cribl

This is one of a series of blog posts that explain how the Cribl SRE team builds, optimizes, and operates a robust Observability suite using Cribl’s products. If you haven’t, we encourage you to read the previous blog about how the Cribl SRE team uses our own products to achieve scalable observability. We installed Cribl Edge on the machines we manage for our users and use it to gather metrics.

Read Post

Cribl

Read more about How the Cribl SRE Team Uses Cribl Edge to Collect Metrics

PromQL Cheat Sheet: Must-Know PromQL Queries

Sep 12, 2024 By Prathamesh Sonpatki, In Last9

This cheat sheet provides practical guidance for diagnosing issues and understanding trends.

Read Post

Last9

Read more about PromQL Cheat Sheet: Must-Know PromQL Queries

PromCon 2024 - Day 1

Sep 12, 2024 By Prathamesh Sonpatki In Last9

Get a quick overview of Day 1 at PromCon 2024, which featured significant announcements on Prometheus 3.0 and OpenTelemetry compatibility.

Read Post

Last9

Read more about PromCon 2024 - Day 1

When Alerts Don't Mean Downtime - Preventing SRE Fatigue

Sep 12, 2024 By Hrishikesh Barua In IncidentHub

A recent question in an SRE forum triggered this train of thought. I've paraphrased the question to reflect its essence. There is plenty to unravel here. My first reaction to this question was that the SRE who posted this is in a difficult place with systemic issues.

Read Post

IncidentHub

Read more about When Alerts Don't Mean Downtime - Preventing SRE Fatigue

The Role of Technology in Enhancing Incident Response Call Etiquette

Sep 11, 2024 By Vishal Padghan In Squadcast

The interconnectedness of today's business environment has significantly heightened the complexity of incident response (IR). The need for immediate action, precise communication, and real-time collaboration is more critical than ever. However, beyond the technical precision required in solving problems, there lies an often overlooked aspect of effective IR management: the etiquette of incident response calls.

Read Post

Squadcast

Read more about The Role of Technology in Enhancing Incident Response Call Etiquette

OpenTelemetry Protocol (OTLP): A Comprehensive Guide to Modern Observability.

Sep 11, 2024 By Gabriel Diaz In Last9

Learn about OTLP’s key features, and how it simplifies telemetry data handling, and get practical tips for implementation.

Read Post

Last9

Read more about OpenTelemetry Protocol (OTLP): A Comprehensive Guide to Modern Observability.

Streaming Aggregation: Real-Time Data Processing in 2024

Sep 11, 2024 By Anjali Udasi In Last9

We break down the essentials of streaming aggregation and its impact on modern data processing.

Read Post

Last9

Read more about Streaming Aggregation: Real-Time Data Processing in 2024

Tutorial 9 - Incident Responders

Sep 11, 2024 By Zenduty In Zenduty

Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle.

View Video

Zenduty

Read more about Tutorial 9 - Incident Responders

Tutorial 10 - Incident Roles

Sep 11, 2024 By Zenduty In Zenduty

Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle.

View Video

Zenduty

Read more about Tutorial 10 - Incident Roles

How to deploy a Slack bot to allow anyone in your team to quickly raise major incidents on Zenduty

Sep 9, 2024 By Vishwa Krishnakumar In Zenduty

One of the biggest challenges for some of our customers was allowing non-engineering teams, such as Support, Sales, or Sustomer Success teams, to raise incidents for specific Dev/Infra/Security/Ops teams on Zenduty in a structured and efficient manner as soon as a customer reports an issue. In many organizations, we observed that non-technical team members often needed to switch between platforms, fill out complex forms, or reach out to multiple stakeholders manually to ensure that an issue is escalated.

Read Post

Zenduty

Read more about How to deploy a Slack bot to allow anyone in your team to quickly raise major incidents on Zenduty

Microservices Monitoring with the RED Method: A Developer's Guide

Sep 6, 2024 By Prathamesh Sonpatki In Last9

This blog introduces the RED method—an approach that simplifies microservices monitoring by honing in on requests, errors, and latency.

Read Post

Last9

Read more about Microservices Monitoring with the RED Method: A Developer's Guide

kube-state-metrics: Your Complete Guide to Simplifying Kubernetes Observability

Sep 5, 2024 By Prathamesh Sonpatki, In Last9

This guide provides an in-depth look at its setup and usage, helping you monitor and manage your Kubernetes clusters more efficiently.

Read Post

Last9

Read more about kube-state-metrics: Your Complete Guide to Simplifying Kubernetes Observability

Burn rate is a better error rate

Sep 4, 2024 By James Frullo In Datadog

While building our Service Level Objectives (SLO) product, our team at Datadog often needs to consider how error budget and burn rate work in practice. Although error budgets and burn rates are discussed in foundational sources such as Google’s Site Reliability Workbook, for many these terms remain ambiguous. Is an error budget a static quantity or a varying percentage? Does burn rate indicate how fast I’m spending a fixed quantity, or is it just another way to express error rate?

Read Post

Datadog

Read more about Burn rate is a better error rate

PromQL: A Developer's Guide to Prometheus Query Language

Sep 4, 2024 By Gabriel Diaz In Last9

Our developer’s guide breaks down Prometheus Query Language in an easy-to-understand way, helping you monitor and analyze your metrics like a pro.

Read Post

Last9

Read more about PromQL: A Developer's Guide to Prometheus Query Language

Instrumenting fasthttp with OpenTelemetry: A Comprehensive Guide

Sep 4, 2024 By Tushar Choudhari In Last9

We cover everything from initial setup to practical tips for monitoring and improving your fasthttp applications. Follow along to enhance your observability and get a clearer view of your app’s performance.

Read Post

Last9

Read more about Instrumenting fasthttp with OpenTelemetry: A Comprehensive Guide

Top Features to Look for in Enterprise Incident Management Software

Sep 3, 2024 By Spandan Pal In Squadcast

Are you tired of dealing with unexpected system crashes and the chaos they bring? You're not alone. For enterprise SREs, DevOps, and IT Operations teams, mastering incident management goes beyond just fixing problems; it’s about preventing them. According to a recent report, incident volume within enterprise companies rose by 16% during 2023, highlighting the growing complexity and risk in digital operations. This underscores the urgent need for robust incident management solutions.

Read Post

Squadcast

Read more about Top Features to Look for in Enterprise Incident Management Software

Introducing Statusy - An Open Source Status Page Aggregator

Sep 3, 2024 By Squadcast In Squadcast

A quick walkthrough of Statusy—an open-source status page aggregator that centralizes service monitoring for your team. Created by Yash Jain at Squadcast, Statusy simplifies tracking with a unified dashboard and flexible notifications. Set up in minutes and keep your team informed! Statusy is fully open source.

View Video

Squadcast

Read more about Introducing Statusy - An Open Source Status Page Aggregator

PromQL for Beginners: Getting Started with Prometheus Query Language

Sep 3, 2024 By Gabriel Diaz In Last9

New to Prometheus? My PromQL beginner's guide teaches you how to write queries, understand data types, and use key functions.

Read Post

Last9

Read more about PromQL for Beginners: Getting Started with Prometheus Query Language

OpenTelemetry Filelog Receiver: Collecting Logs from Kubernetes

Sep 2, 2024 By Prathamesh Sonpatki In Last9

Master log collection in Kubernetes with OpenTelemetry's filelog receiver. Learn to configure, optimize, and troubleshoot log collection from various sources including syslog and application logs. Discover advanced parser operator techniques for robust observability.

Read Post