Operations | Monitoring | ITSM | DevOps | Cloud

Centrally set up and scale monitoring of your infrastructure and apps with Datadog Fleet Automation

Setting up and scaling observability across large, distributed environments often requires platform and SRE teams to coordinate access to infrastructure hosts and switch between configuration management tools and product-specific documentation. These tasks increase setup time and create delays in establishing visibility of critical services in Datadog. As teams expand their infrastructure, they need to coordinate Datadog configuration changes in a consistent and auditable way.

Intelligent Agents vs. Intelligent Attackers: The New Threat Detection Paradigm

Most security stacks only move when told to. They wait for known IOCs, hunt for pre-defined suspicious strings, and trigger automation only after a condition lights up. By then, attackers have already pivoted. Agentic AI rewrites the rules. Instead of signature-based detection, it monitors behavioral baselines and identity signals, watching for violations of expectations formed from observed context.

Confessions of a software engineer who enjoyed being paged at 5am

It’s 5:14am, and I wake up to the squawking geese sound of my PagerDuty alert (anyone else have this sound? No?). I’m four months into working for my new team as a junior software engineer, and this is my first time being paged in the middle of the night. Most software engineers probably dread this moment, but I kind of love it. Agile ceremonies and Jira tickets suddenly don’t matter, and you’re fully focussed on stopping a customer-impacting fire.

Gartner I&O and Cloud Strategies Conference 2025: From Observability to Outcome-Driven Operations

This year’s Gartner IT Infrastructure, Operations and Cloud Strategies Conference made one thing abundantly clear: the industry is moving beyond reactive monitoring and isolated dashboards toward autonomous, outcome-driven IT operations. While AI and agentic automation dominated keynotes and vendor messaging, conversations on the show floor reflected a more grounded reality.

Elastic at AWS re:Invent: Concluding a year of partnership in agentic AI innovation

Highlights of another laudable year of customer-centric collaboration The integration of Elastic’s capabilities, including vector databases and context engineering, with AWS services helps customers build intelligent, scalable, and secure applications faster and with greater flexibility. Our ongoing collaboration has resulted in another year of notable innovation with AWS. This blog highlights our continued collaboration with AWS throughout 2025 to help you capitalize on the power of AI.

Scaling Kubernetes GitOps with Fleet: Experiment Results and Lessons Learnt

Fleet, Rancher’s built-in GitOps engine, is designed to scale up to thousands of clusters. However, “how far” can it scale in a real world scenario, you might ask? Earlier this year, we wrote about the Fleet benchmark tool and we made a few discoveries that were very instructive, especially concerning resource consumption and its impact on deployments’ performances.

Application Monitoring 101: Queue Time Can Alert Before a Breakdown

Regular monitoring practices can emphasize application response time, but queue time is also often an early and important warning sign. If it rises, you’ll quickly see downstream effects: tail latency, timeouts, and error spikes. This means that this metric can give you a head start tackling app issues before they become user problems. In this post, we’ll discuss queue time, how things can go off track, and practical steps to turn it around.

Valkey JSON module now available on Aiven for Valkey

The Valkey JSON module implements native JSON data type support within Valkey, allowing users to efficiently store, query, and modify complex, nested JSON data structures directly. This overcomes previous architectural complexities, such as needing to serialize entire documents as strings or flatten data into hashes, by providing native handling for nested data models.

Closing the Year: What 2025 Taught Us About Resilience

By Doreen Jacobi, DERDACK / SIGNL4 It is that time of the year again. Time to reflect and look back at 2025. And I find myself thinking less about platforms and features – and more about the people behind them. The engineers who pick up the phone at 2 a.m. The operators who make judgment calls with incomplete information. The responders who keep systems running when everything feels urgent. If this year taught us anything, it’s this: technology can detect the problem, but people solve it.