Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Netdata Now Troubleshoots Your Alerts for You

The 2 AM pager alert. For anyone in Ops, SRE, or IT administration, those words trigger a familiar sense of dread. An alert has fired. Is it a real fire, or another false alarm waking you from a dead sleep? The pressure is on. Every minute of downtime costs money and reputation, but troubleshooting a complex system when you’re sleep-deprived is a Herculean task.

AI Agent Is Hitting Your APIs - Are You Ready?

It’s no longer theoretical – artificial intelligence has left research labs and entered production systems, generating a new breed of consumers – autonomous and intelligent agents. These autonomous AI agents are increasingly interacting with real-world APIs (application programming interfaces), which are sets of protocols and tools for building and integrating software applications.

Building your AI infra, our tips

Modular architecture: Decouple compute from storage so each can scale independently. This makes it easier to adapt to growing or shifting workloads over time. Future-ready hardware: Select GPUs and CPUs not just for current workloads but with an eye on scalability, including support for newer accelerator types. Scalable design: Ensure the system allows seamless addition of compute nodes or storage without a full redesign.

Running AI without blowing up your storage

Storage is often underestimated: In infrastructure discussions, compute and networking get most of the attention, while storage is treated as secondary. For AI workloads, that can be a costly oversight. Data throughput for specialized hardware: AI infrastructure powered by GPUs can process massive volumes of data at unprecedented speeds. This puts immense pressure on the storage system to keep up. Scale-out performance: An on-prem, scale-out, software-defined storage setup allows you to meet high performance demands, grow capacity as needed, and stay in control of infrastructure costs.

Bridging the Gap: 3 Practical Strategies to Align Security and Operations in DevOps

The gap between security operations and IT operations poses significant risk. It’s increasingly clear that DevOps leaders, IT managers, and enterprise teams face an uphill battle to manage growing threat complexity, endless patches, and compliance requirements while operating in silos. Bridging this gap is essential to effectively manage risks and enhance operational efficiency.

Securing the Invisible: Why Ambient AI Needs Next-Gen Security

If, like me, you’re continuously striving to keep pace with the ever-evolving world of artificial intelligence, you’re probably hearing a lot about how Ambient AI is poised to dominate discussions and developments throughout the second half of 2025. Ambient AI refers to artificial intelligence systems that operate unobtrusively in the background of our daily environments, constantly sensing, analyzing, and responding to various inputs without explicit human interaction.

Librato on Heroku is Going Away and Hosted Graphite Is the Better Next Step

Librato (a SolarWinds product) is being sunsetted summer of 2025, and that directly affects Heroku teams who’ve relied on the Librato add-on for “good enough” visibility into dynos, routers, and Postgres. If you’re in that group, you’ll need a replacement monitoring add-on that keeps you covered on Heroku and lets you grow beyond it without re-architecting how you ship metrics.

The strategic art of build vs. buy in software delivery ft. Tara Hernandez of MongoDB

Rob Zuber sits down with Tara Hernandez, VP of Developer Productivity at MongoDB and former Netscape engineer who helped create early continuous integration systems, to explore strategic frameworks for build vs. buy decisions in modern software delivery.

Jaeger Monitoring: Essential Metrics and Alerting for Production Tracing Systems

Your Jaeger setup is running. Traces are coming in, and the UI is helping you spot slow services or debug broken flows. But just like any part of your observability stack, Jaeger needs some basic monitoring to stay reliable. If the collector starts queueing spans or the agent runs out of buffer, it can lead to dropped traces, sometimes without any obvious sign in the UI. This blog focuses on the operational side of Jaeger.