Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Why Quality Matters: A Conversation with NDepend

In this episode of Founder & Friends, John-Daniel Trask, co-founder and CEO of Raygun, sits down with Patrick Smacchia, Founder and CEO of NDepend, to share their stories and strategies for building excellent software. They discuss the intricacies of the.NET ecosystem, strategies for sustaining high-quality software, and the evolution of development tools. Gain insights into NDepend's methods for managing dependencies, refining code, and optimizing performance. This episode is essential for developers aspiring to advance their technical abilities and produce superior software.

Flaky tests: their hidden costs and how to address flaky behavior

Flaky tests are bad—this is a fact implicitly understood by developers, platform and DevOps engineers, and SREs alike. When tests flake (i.e., generate conflicting results across test runs, without any changes to the code or test), they can arbitrarily fail builds, requiring developers to re-run the test or the full pipeline. This process can take hours—especially for large or monolithic repositories—and slow down the software delivery cycle.

Beyond Their Intended Scope: Uzing into Russia

The first installment of our new blog series, Beyond Their Intended Scope, covers BGP mishaps that may have escaped the community’s attention but are worthy of analysis. In this post, we review a recent BGP leak that redirected internet traffic through Russia and Central Asia as a result of a path error leak by Uztelecom, the incumbent service provider of Uzbekistan.

Key Metrics to Monitor for a Healthy Kafka Cluster

Maintaining a healthy Kafka cluster is critical to ensuring your real-time data pipelines run smoothly. However, keeping your Kafka environment in tip-top shape isn’t just about setting it up and letting it run. Regular monitoring of key metrics is essential to catch issues before they escalate, optimize performance, and keep everything humming along smoothly. So, what should we be looking at when it comes to Kafka metrics? Let’s break down the most important ones and how to interpret them.

AWS X-Ray vs Jaeger - Choosing the Right Distributed Tracing Tool

Distributed tracing has become an essential part of any application's performance monitoring strategy. As businesses adopt distributed architectures, choosing the right tracing tool is crucial for efficient troubleshooting and performance monitoring. The two most prominent choices are AWS X-Ray and Jaeger, each offering unique features and advantages. AWS X-Ray, a managed service by Amazon, simplifies tracing for applications running on AWS.

Infrastructure Monitoring Checklist: What you should monitor

You want to monitor your infrastructure? Monitoring is essential to ensure system stability, security and optimal performance. Without proper monitoring, small issues can quickly escalate into major problems and affect productivity and service availability. While there is no fixed checklist for infrastructure monitoring and it depends on your setup, there are some key areas that are worth considering when building your own monitoring strategy that fits the needs of your own environment.

Determining a CoPE's Efficacy-and Everything After

As discussed in the first article in this series, a Center of Production Excellence (CoPE) is a more or less formal, provisional subsystem within an organization. Its purpose is to act from within to change that organization so that it’s more capable of achieving production excellence. The series has, to date, focused mainly on how best to construct such a subsystem and what activities it should pursue.

12 Benefits You Get by Scaling with Netdata

80% of decision-makers globally acknowledge that digital infrastructure is essential for reaching business goals. However, IT infrastructure is becoming increasingly distributed and complex. Organizations are managing hundreds—even thousands—of nodes across cloud, on-premise, and edge environments. This predicament makes effective monitoring across all systems more essential than ever.