Operations | Monitoring | ITSM | DevOps | Cloud

How AI-powered anomaly detection is transforming APM for SREs

Site reliability engineers (SREs) often face challenges in keeping an organization’s sites running smoothly as the complexity of distributed systems steadily increases. With the rise of microservices, cloud-native architectures, and massive data volumes, manual monitoring and troubleshooting are no longer sustainable. SREs must navigate hurdles like alert fatigue, incident response delays, and the constant pressure to maintain system reliability.

Getting Started with M365 dashboards

SquaredUp is a flexible dashboard and analytics platform that makes it really easy to dashboard your M365 and Intune usage and analytics. You can then use it for monitoring or sharing! In this article we’ll take a look at getting started with the M365 plugin for SquaredUp and building our first dashboard. Sign up for a free account if you’d like to follow along.

Petabyte Scale, Gigabyte Costs: Mezmo's Evolution from ElasticSearch to Quickwit

At Mezmo, we handle an enormous volume of telemetry data for our customers and ourselves, requiring a robust and efficient search and analytics backend. For years, ElasticSearch served us well, but as our infrastructure grew to a multi-cluster, multi-petabyte scale, we started to see the cracks—rising costs, performance bottlenecks, and scalability concerns. We needed a change, one that would make our system more cost-effective while maintaining speed and reliability.

Kubernetes Monitoring and Alerting Made Easy with Splunk Observability Cloud and OpenTelemetry

In this video, I'll show you how to quickly setup monitoring and alerting for your Kubernetes clusters using Splunk Observability Cloud. We’ll start by deploying the Splunk OpenTelemetry Collector using Helm, and then use the Kubernetes Navigator inside Splunk Observability Cloud to view the health of our cluster and the applications it’s hosting. I’ll demonstrate AutoDetect detectors and alerts by intentionally triggering an issue in the cluster and walk through the alerting process. We’ll review the alerts in Splunk Observability Cloud and then resolve the issue in the cluster.

#035 - Beyond Kubernetes: A Veteran of the Container Wars on the Past, Present, and Future of Clo...

This episode of "Kubernetes for Humans" features Dan Ciruli, a Senior Director of Product Management at Nutanix, who shares his journey in tech and his perspective on the evolution of cloud-native technologies. Ciruli discusses his early career as an engineer and his transition to product management, noting that the role was not well-defined in the 1990s. He recounts his experiences with startups, Google, and D2IQ (formerly Mesosphere), highlighting the rise of Docker and projects like Mesos.

Locking Down PostgreSQL with SSL: Secure Remote Connections Like a Pro

PostgreSQL is a beast when it comes to handling data, but if you're running an instance that needs to be accessed remotely, securing it with SSL is non-negotiable. Without SSL, your database connection is essentially an open book for anyone snooping on the network. Let’s lock it down with properly signed certificates!

Monitoring coffee: Tales from Hosted Graphite's secret lab

It has been said that software engineers are organisms that convert caffeine into code. Not all software engineers need coffee to get by, but it's popular enough that it'd be silly for us not to have an office coffee machine... …it'd also be sort of silly for a monitoring company not to monitor that coffee machine, which is so crucial that we could make a reasonable argument for it being part of the production infrastructure.

Love at First Ticket: Training IT on User Support

Technicians go through plenty of technical training, whether it's formal or on-the-job, but many overlook one of the most challenging parts of the job -- talking to users. Unfortunately, users are often contacting IT when something is wrong, which means they're stressed and have a tendency to take that out on the people just trying to help. In this stream, we talked with a few IT leaders to discuss how they train their team members to support users and make their organization fall in love with the IT team.