Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

OpenTelemetry, AI, and the Future of Observability with Andreas Grabner

Shubham Srivastava from our team had the pleasure of meeting Andreas Grabner at KubeCon + CloudNativeCon Europe earlier this year. Andreas wears many hats in his daily work, primarily serving as a DevOps Activist at Dynatrace, where he has dedicated over 16 years to shape the Observability solutions we see today. He is also a Developer Advocate at Keptn – helping teams automate and orchestrate their deployments end-to-end and plays an active role as an Ambassador in the CNCF community.

Why First-Call Resolution Is Non-Negotiable in Modern Business

In 1750 BCE, in the bustling heart of ancient Mesopotamia, a copper merchant named Ea-nāṣir thought he had closed another routine sale of copper ingots. Little did he know, his customer wasn't exactly thrilled. In fact, the customer was so displeased that he decided to write Ea-nāṣir a strongly worded letter. Yes, you heard that right! A literal stone tablet of dissatisfaction, complaining about the shoddy grade of copper and some other delivery mishap.

MTBF, MTTR, MTTF, MTTA: Incident Metrics Explained

When it comes to managing incidents and ensuring operational efficiency, understanding key metrics is crucial. Among the most important are MTBF (Mean Time Between Failures), MTTR (Mean Time To Repair), MTTF (Mean Time To Failure), and MTTA (Mean Time To Acknowledge). In this blog, we'll explore these metrics along with some best practices and practical applications.

Four Golden Signals: Key Indicators for System Reliability

System reliability is crucial for providing seamless user experiences and enabling effective business operations. The "4 Golden Signals" —latency, traffic, errors, and saturation—offer a comprehensive view of system performance and potential issues. In this blog, we deep dive into system reliability and explore these four key metrics for monitoring system health and ensuring optimal performance.

The Reliability Stories You Won't Hear on LinkedIn

We had the pleasure of meeting Ponmani Palanisamy, a Staff Site Reliability Engineer at LinkedIn, at a recent SRE Meetup in Bangalore. Ponmani gave an insightful talk on "Improving data redundancy and rebalancing data in HDFS." We were captivated by his talk and eager to learn more about his experience in the reliability space. We talked about everything including his journey, experiences, and of course, his most memorable war room stories over a steady career of 17 years. Here's what he had to share.

KPI vs. SLA: Important Metrics in Incident Management

Organizations prioritize Key Performance Indicators (KPIs) and Service Level Agreements (SLAs) to achieve optimal performance. However, understanding the differences between KPIs and SLAs can be challenging. In this blog, we discuss everything about Key Performance Indicators (KPIs), Service Level Agreements (SLAs), and the key differences between KPIs vs SLAs.

Insights of an Observability Advocate: The Challenges and Rewards

At a recent SRE Meetup in Bangalore, we had the pleasure of meeting Akshay Deshpande. During our conversation, Akshay, who manages a Performance/Observability Engineering team at Smarsh discussed his passion for observability and his constant drive to improve the field. Smarsh helps companies gain valuable insights from their communication data, enabling them to proactively identify potential regulatory and reputational risks before they escalate.

What Is Denormalized Data?

Traditional database design prioritizes data integrity through normalization. However, for read-heavy workloads, normalized data structures can lead to complex queries and slower performance. Denormalization offers an alternative approach to optimize query execution and improve efficiency. A study concluded that denormalization can improve query performance when implemented with a thorough understanding of application requirements.

Bob Lee - Lead DevOps Engineer at Twingate

I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability. I had the privilege of meeting Bob Lee, who currently leads DevOps at Twingate — a cloud-based service that provides secured remote access, and poised to replace VPNs.

Strategies for Scaling Systems Reliably by Bob Lee

I was out there in sunny Austin this February, speaking at Civo Navigate 2024. The event was jam packed with amazing talks, and it was great meeting so many people with long and fascinating careers in engineering and Site Reliability. I had the privilege of meeting Bob Lee, who currently leads DevOps at Twingate — a cloud-based service that provides secured remote access, and poised to replace VPNs.