Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

What Are AI Workloads? Everything Ops Teams Need to Know

AI workloads break every assumption you have about infrastructure management. AI is everywhere. Machine learning-based tools are answering customer service questions, accelerating incident resolution, catching fraudulent transactions, spotting defects on production lines, and powering late-night searches that delve into the random topic that pops into your head right before bedtime. Behind every prediction, response, or generated sentence is massive computing power doing serious, continuous work.

AI Observability: How to Keep LLMs, RAG, and Agents Reliable in Production

AI observability closes the gap between “something’s wrong” and “here’s what to fix.” If you run AI in production, you might have felt the whiplash. Yesterday, your LLM answered in 300 milliseconds (ms). Today p99 crawls, costs spike, and nobody’s sure if the culprit is model behavior, data freshness, or GPUs stuck at the ceiling. Dashboards light up, but they don’t tell you which issue puts customers at risk. That’s the gap AI observability closes.

Use OpenTelemetry with Observability Pipelines for vendor-neutral log collection and cost control

Today, many DevOps and security teams operate in a world of complex, hybrid, or multi-vendor environments. As more teams look to avoid lock-in by adopting open standards, OpenTelemetry (OTel) is quickly gaining adoption as the primary open source method for DevOps and security teams to instrument and aggregate their telemetry data. However, OTel alone may lack the advanced processing functions, native volume control rules, and hybrid environment support that large organizations need.

Episode 1 - Preparing the workforce for AI | The Intelligent Enterprise

In our first podcast episode of The Intelligent Enterprise, Ricardo Costa, Senior Vice President and Chief Technology Officer at Purolator, gives us his views on how to prepare the workforce for AI. In his role as a technology "translator" connecting business strategies with tech implementations, Ricardo highlighted the importance of translating complex tech concepts into simple, understandable stories and addressing leadership challenges in preparing the workforce for AI, including upskilling and ethical considerations.

Cloud Status Check Overview

In this video, we provide an overview of Uptime.com's Cloud Status check feature, designed to monitor the status of common cloud services within your technology stack. We walk you through the step-by-step process to configure a Cloud Status check, including how to select third-party services, add contacts, and organize checks with tags. Learn how to view incident history and get detailed updates from third-party providers. For more information, visit our documentation or contact our support team.

Search Telemetry Without Limits in a Multi Cloud and AI World

Cribl Search gives you one lens across all your telemetry data no matter where it lives. Instead of forcing teams to move data into one system or jump between tools, you get a familiar pipe based query experience with dashboarding and alerting built in. Storage and query processing stay separate so you decide where your data lives while your users get fast, simple access in one place.

Introducing Logs, User Feedback, and more in the Sentry Godot SDK

With the first stable releases out of the gate, we’re happy to announce that Sentry’s Godot SDK is now ready for general use, supporting Windows, Linux, macOS, iOS and Android. We started full-time development a year ago with just a few prototypes, and now it's finally here - built on top of the mature Sentry platform SDKs, it comes as a GDExtension add-on that you can easily add to your Godot projects.

Inside the Cloudflare Outage: Real-World Data from UptimeRobot

On November 18th, 2025, a large Cloudflare outage briefly broke big chunks of the internet. For several hours, users around the world were greeted with 500 errors, including platforms like X, ChatGPT, Spotify, and many others that run behind Cloudflare’s network. At UptimeRobot, we sit in a slightly unusual spot during events like this: So when Cloudflare has a bad day, we see it twice: once in the alerts we send to our customers, and again in how it affects parts of our own infrastructure.

What's New in InfluxDB 3.7: One-Click Monitoring, Faster Configuration, and Better Operational Clarity

InfluxDB 3.7 is now available for both Core and Enterprise, landing alongside version 1.5 of the InfluxDB 3 Explorer UI. This release focuses on giving developers faster visibility into what their system is doing with one-click monitoring, a streamlined installation pathway, and broader updates that simplify day-to-day operations. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2.