Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Shorten your MTTR with Checkly Traces

We all know that Checkly is a ‘secret weapon’ for engineering teams who want to shorten their mean time to detection (MTTD). With Checkly, you can know within minutes if your service is unavailable for users, or acting unexpectedly. In this article we’ll talk about how Checkly traces can help you expand on the benefits of Checkly, adding insights that will help you diagnose root causes, and further reduce your mean time to resolution (MTTR) for outages and other incidents.

AI Governance in 2025: A Full Perspective on Governance in Artificial Intelligence

In a world where artificial intelligence (AI) is leaping forward — growing at a CAGR of almost 36% from 2024 to 2030 — questions about governance and ethics with the use of AI are surfacing. As humans continue to develop AI systems, it is crucial to establish proper guidelines to ensure powerful technologies like generative AI and adaptive AI are used in a responsible manner.

Key metrics to monitor for optimal SQL Server performance

Microsoft SQL Server is a critical database component of many business applications, ensuring data integrity, fast query performance, and seamless transactions. However, maintaining peak performance requires proactive monitoring of essential metrics. In this blog, we’ll explore the key SQL Server performance metrics you should track and how they help prevent performance issues, optimize resource usage, and enhance database efficiency.

Challenges in Monitoring Applications That Use OAuth

OAuth (Open Authorization) has become a critical component in enabling secure and third-party access to APIs which makes it one of the most widely adopted authentication protocols for modern applications. From allowing users to sign into apps using their Google or Facebook accounts to enabling third-party service integrations, OAuth simplifies the process of granting access to resources without compromising security.

What are Kubernetes audit logs and how to monitor them?

Security and compliance: Many industries, especially those governed by regulations like HIPAA, the PCI DSS, or the GDPR, require detailed logs for compliance and to trace security incidents. Troubleshooting and forensic analysis: If something goes wrong—whether due to accidental configuration changes or malicious activity—having detailed logs helps diagnose the root cause and quickly remediate it.

Using Amazon RDS for high availability: How monitoring ensures reliable failover

Database downtime can lead to significant disruptions, revenue loss, and frustrated users. Amazon Relational Database Service (RDS) provides a managed database solution with high availability and automated failover to minimize such risks. However, continuous monitoring is crucial to ensuring reliable failover and minimizing downtime by detecting potential issues before they impact operations.

Managing Multiple Service Instances with a Systemd Generator

When working with systemd services in Linux, you might encounter situations where multiple instances of a service need to be managed dynamically. When I had to develop a solution to monitor multiple Kubernetes clusters with Icinga for Kubernetes, I ran into exactly this challenge.

Why Context Matters: Mastering Serverless App Monitoring

Hi there, and welcome to the second video in this series on observing AWS serverless applications with Datadog. In this video, you’ll learn how important it is to add custom business context to the telemetry you send to Datadog and how you can use that inside APM to quickly diagnose and debug issues. You’ll walk away with an understanding of the importance of distributed tracing, as well as how you can add specific business context to the telemetry you send.

Netdata vs. Prometheus: Which Monitoring Tool is Right for You? #monitoring #realtime

Netdata's founder Costa Tsaousis built Netdata with performance and efficiency in mind. The result? 8x less RAM usage, 30x less disk I/O, 40x more data retention, 40x more data stored, and up to 22x faster queries—all thanks to our innovative tiered storage system, enabling ultra-efficient long-term queries.