Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Scale Chaos Engineering with Automation and AI

Chaos Engineering and Fault Injection testing have been proven to prevent outages, increase availability, and help companies avoid costly downtime. But without the right processes or tools, they require specialized knowledge, a deep understanding of systems, and manual effort for every test. To fully realize the benefits of Chaos Engineering, testing needs to be adopted across all engineering teams without causing a lift or investment that takes away from roadmap progress.

C15 Roadmap & Release 22

We’re excited to launch Release 22—our most advanced update yet. It delivers smarter controls, deeper customization, and long-term reliability. Key improvements include enhanced handling of TTY messages with Wireshark support, flexible call history recording, new Stir/Shaken override options for better traceability, and real-time call limit tracking with an upgraded interface. Plus, starting March 25, 2026, SIP code 603+ will notify callers when calls are blocked due to analytics, in line with FCC regulations.

Enhanced Flexibility and Security Monitoring - New in DataStream

This update delivers significant advances in operational flexibility and security monitoring capabilities. It addresses the evolving needs of security teams across diverse deployment environments, from air-gapped networks to those prioritizing automation and simplicity, while expanding integration options and improving visibility into data flows.

Fix flaky tests in your sleep with Chunk by CircleCI

A test fails. You rerun it and it passes. You shrug and move on. This is how most teams deal with flaky tests. The “rerun until green” approach works in the moment, and rerunning from failed tests is a useful way to confirm whether a failure is real. But reruns don’t fix the underlying issue. Over time, they burn CI resources and can hide real instability in your code. On the other hand, fixing flaky tests can mean hours of work.

What Is Business Continuity?

A single outage can stop operations, affect customers, and impact trust. In a world of pandemics, cyberattacks, weather events, and supply chain delays, your team cannot pray that something does not break. Business continuity drives your team to stay ready, recover earlier, and keep downtime lower. In this blog, we’ll explain what business continuity means, how to create a solid business continuity plan, and which approaches help teams keep operational during a disruption event.

Simple Talk Podcast - Coffee Chat with Lee Brownhill

Steve sits down with Lee Brownhill, who by day helps clients optimize their SQL workloads in Azure and AWS at Cloud Rede, but is also a Redgate Ambassador, blogger and aspiring speaker. Lee talks about his interest in giving back to the SQL Server community through writing and speaking, having taken inspiration from others online and in-person at events, and naturally the conversation also touches upon AI, the cloud, and more.

What Is Incident Response Lifecycle?

The Incident Response Lifecycle is a step-by-step process that helps engineering teams detect, respond to, and recover from unexpected system disruptions or outages. It includes a series of six practical stages: Detection, Analysis, Impact Mitigation, Incident Resolution, Service Restoration, and Post-Incident Analysis. By following this lifecycle, teams can minimize downtime, reduce business impact, and continuously strengthen system reliability.

Why your Kubernetes clusters and GPUs should live under one roof

The world remains abuzz with AI hype, but the reality is that most modern applications aren’t purely AI workloads. The average company will have web services, APIs, databases, and background jobs running alongside its machine learning inference or training components. An architecture question everyone faces: should your Kubernetes cluster and GPU compute live in the same data center, or can you split them across providers?