Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Flaky tests: their hidden costs and how to address flaky behavior

Flaky tests are bad—this is a fact implicitly understood by developers, platform and DevOps engineers, and SREs alike. When tests flake (i.e., generate conflicting results across test runs, without any changes to the code or test), they can arbitrarily fail builds, requiring developers to re-run the test or the full pipeline. This process can take hours—especially for large or monolithic repositories—and slow down the software delivery cycle.

Beyond Their Intended Scope: Uzing into Russia

The first installment of our new blog series, Beyond Their Intended Scope, covers BGP mishaps that may have escaped the community’s attention but are worthy of analysis. In this post, we review a recent BGP leak that redirected internet traffic through Russia and Central Asia as a result of a path error leak by Uztelecom, the incumbent service provider of Uzbekistan.

Key Metrics to Monitor for a Healthy Kafka Cluster

Maintaining a healthy Kafka cluster is critical to ensuring your real-time data pipelines run smoothly. However, keeping your Kafka environment in tip-top shape isn’t just about setting it up and letting it run. Regular monitoring of key metrics is essential to catch issues before they escalate, optimize performance, and keep everything humming along smoothly. So, what should we be looking at when it comes to Kafka metrics? Let’s break down the most important ones and how to interpret them.

AWS X-Ray vs Jaeger - Choosing the Right Distributed Tracing Tool

Distributed tracing has become an essential part of any application's performance monitoring strategy. As businesses adopt distributed architectures, choosing the right tracing tool is crucial for efficient troubleshooting and performance monitoring. The two most prominent choices are AWS X-Ray and Jaeger, each offering unique features and advantages. AWS X-Ray, a managed service by Amazon, simplifies tracing for applications running on AWS.

Infrastructure Monitoring Checklist: What you should monitor

You want to monitor your infrastructure? Monitoring is essential to ensure system stability, security and optimal performance. Without proper monitoring, small issues can quickly escalate into major problems and affect productivity and service availability. While there is no fixed checklist for infrastructure monitoring and it depends on your setup, there are some key areas that are worth considering when building your own monitoring strategy that fits the needs of your own environment.

Determining a CoPE's Efficacy-and Everything After

As discussed in the first article in this series, a Center of Production Excellence (CoPE) is a more or less formal, provisional subsystem within an organization. Its purpose is to act from within to change that organization so that it’s more capable of achieving production excellence. The series has, to date, focused mainly on how best to construct such a subsystem and what activities it should pursue.

12 Benefits You Get by Scaling with Netdata

80% of decision-makers globally acknowledge that digital infrastructure is essential for reaching business goals. However, IT infrastructure is becoming increasingly distributed and complex. Organizations are managing hundreds—even thousands—of nodes across cloud, on-premise, and edge environments. This predicament makes effective monitoring across all systems more essential than ever.

The Ultimate List of Incident Management Tools in 2024

Incident management tools are important for organizations to effectively handle service outages. With so many incident management tools around with different feature sets, it's often difficult to find the one that is right for your needs. In this article, we attempt to make a list of incident management software available in 2024 with their features to help you arrive at the right one.

RabbitMQ vs Kafka: Which Is Right for You?

For distributed systems and microservices, message brokers play a very important role. Message brokers keep data flowing smoothly between different parts of our applications. Two names that often come up in discussions about message brokers are RabbitMQ and Kafka. But what exactly are they, and how do they differ?