Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Telemetry 101: An Introduction To Telemetry

Understanding system performance is critical for gaining a competitive advantage. Telemetry provides deeper insights into the system, helping business owners make better decisions. This article take a comprehensive look at the topic of telemetry. We’ll look at its functionality and telemetry types. We’ll also look at all the things telemetry data can help you with — plus the challenges companies with telemetry systems might face.

Troubleshooting Common Kafka Conundrums

This is the third blog in our series on Kafka, where we continue to explore the nuances of deploying Kafka for scale. In our previous blogs, Essential Metrics for Kafka Performance Monitoring and Auto-Instrumenting OpenTelemetry for Kafka, we laid the foundation for understanding Kafka’s performance and monitoring aspects. Now, as we explore further into the Kafka ecosystem, we’re here to tackle the common challenges that can arise during deployment and scaling.

Cloud Imperium Games moves ELK stack with ChaosSearch.

Cloud Imperium Games (CIG) is a prominent video game development company known for its ambitious project, Star Citizen, which aims to be an open-world, massively multiplayer online space simulation game. As a result of the game's popularity, all the metrics, events, and logs, generated to track every single action during gameplay, also experienced explosive growth in terms of volume and also in diversity (a consequence of the dynamic and fast-paced development environment).

Demo of Internet Sonar: From Disruption to Instant Detection

Catchpoint's new Internet Sonar shows you global Internet status at a glance in an AI-powered, real-time, interactive dashboard and map. It answers the first question any IT team needs to ask when there's an outage: "Is it me, or is it something else?" Key product features: In this recorded live demo session, leaders from our Product team will walk you through how Internet Sonar works, how you can use it to lower MTTR, and how organizations are using it to save millions.

What is Zero Trust Reliability in engineering: Piyush Verma - The Reliability Podcast

The Reliability podcast aims to speak with engineers who have worked on large, complex systems and glean through their learnings. What best practices should one imbibe? What are non-negotiable learnings to become better at a craft? What’s ‘engineering’ going to be like with the advent of AI? We answer these and more tracing personal journeys of engineers who have built stellar careers around decoding the innumerable intricacies of software engineering.

Production vs Local in engineering: Piyush Verma - The Reliability Podcast

The Reliability podcast aims to speak with engineers who have worked on large, complex systems and glean through their learnings. What best practices should one imbibe? What are non-negotiable learnings to become better at a craft? What’s ‘engineering’ going to be like with the advent of AI? We answer these and more tracing personal journeys of engineers who have built stellar careers around decoding the innumerable intricacies of software engineering.

The Link Between Early Detection and Internet Resilience: A Lesson from Salesforce's Outage

Almost every study examining the hourly cost of outages invariably leads to a clear and undeniable conclusion: outages are expensive. According to a 2016 study, the average cost of downtime was estimated at approximately $9,000 per minute. In a more recent study, 61% of respondents stated that outages cost them at least $100,000, with 32% indicating costs of at least $500,000 and 21% reporting expenses of at least $1 million per hour of downtime.