Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Monitor Alcide kAudit logs with Datadog

Kubernetes audit logs contain detailed information about every request to the Kubernetes API server and are critical to detecting misconfigurations and vulnerabilities in your clusters. But because even a small Kubernetes environment can rapidly generate lots of audit logs, it’s very difficult to manually analyze them.

Introducing Prometheus-style alerting for Grafana Cloud

Hi! My name’s Richard Lam, and I’m the new product manager for Grafana Cloud. I’m really excited for my first contribution to this community, both so I can introduce myself to you all, and so I can highlight an awesome new Grafana Cloud feature that’s coming your way! Happy reading, and hopefully this is just the start of many more communications from me.

TrackJS for Node

TrackJS error monitoring, on your servers. We’re thrilled to announce official support for Node environments and the 1.0.0 release of our Node agent. We’ve actually had Node since sometime last year, but we’re finally formalizing it as a first-class citizen and fully-supported part of TrackJS! Here are some of the cool things you can do with TrackJS for Node.

Using Machine Learning for Root Cause Analysis

From a security breach to a complete system outage, when an incident occurs and your network or service is impacted, it’s typically the result of a chain of events. A problem with one service has impacted another service, and so on until finally, you’re facing a problem that’s compromising availability and damaging your customer experience. In the event of a serious incident, your team’s immediate response is to focus on identifying the root cause and restoring service.

How To Succeed When Adopting A Multi Cloud Environment

Today, a vast majority of companies are working with multiple cloud providers. But moving IT operations to the cloud has significant consequences they need to deal with. Discover how Broadcom helps customers to manage critical workloads in multi-cloud environments, simplifying and accelerating the deployment of new business services.

How Automation Helps The Site Reliability Engineer

Automation has been with us for decades now and with years of experience and experimentation we are arriving at a best practice known as site reliability engineering. Site reliability engineering seeks to manage the risk imposed from multiple agile changes to protect business revenues and sustain positive customer experiences.

Find Where N+1 Database Queries Affect Your Application

One of the Scout’s key features is its ability to quickly highlight N+1 queries in your application that you might not have been aware of, and then show you the exact line of code that you need to look at in order to fix it. In this video, we will use a Ruby on Rails application as an example, but the same concepts apply to other popular web frameworks.