Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Cloud monitoring, security and related technologies.

Amazon Isn't Eating Its Own DNS Dog Food

On October 19-20, 2025, Amazon Web Services (AWS) experienced a significant outage (AWS status) affecting its US-EAST-1 region in northern Virginia. The root cause was DNS resolution failures for DynamoDB’s API endpoints, which cascaded across AWS’s interconnected services, disrupting major platforms including Snapchat, McDonald’s, Disney+, Roblox, Coinbas, Reddit, and Amazon’s own services.

Sustainable Cloud Computing in the UK: Challenges, Opportunities, and the Future

The tech industry's environmental impact is a growing concern, but can collaboration and innovation drive sustainability? At Civo Navigate London 2025, Regent Lee, Dinesh Majrekar, Liam McTague, and Simon Morris explored the challenges and opportunities of reducing emissions in the tech industry.

AWS Outage: How do you prepare for the failure of your own safety net?

When AWS’s massive outage struck, it didn’t just take down cloud services, apps, and enterprise platforms. It also knocked out many of the monitoring systems organizations depend on for real-time answers. Observability companies, including Datadog, New Relic, Checkly, Dynatrace, SpeedCurve, and Splunk Observability, lost visibility or functionality precisely when organizations needed them most.

Data Sovereignty in the Age of AI: A Conversation with Kelsey Hightower and Mark Boost

Join Kelsey Hightower and Mark Boost at Civo Navigate London as they discuss sovereignty in the context of AI and cloud computing. The conversation highlights the need for a more nuanced approach to cloud computing, one that balances the benefits of public cloud with the need for control and sovereignty. The discussion emphasizes the importance of open protocols and the role of the community in driving innovation, and notes that the adoption of AI workloads is driving a shift towards more decentralized and sovereign cloud architectures.

When AWS Goes Down: What It Means For Your Cloud Costs

A global outage at Amazon Web Services (AWS) did more than knock popular apps offline. It laid bare the cost risks embedded in many cloud architectures. As services fail, the hidden costs of high availability, from redundancy planning to recovery operations, often multiply. For cloud cost leaders, this isn’t an issue of uptime; it’s a visibility and budget-shock issue. It’s a key reminder that architecting for resilience involves difficult trade-offs.

PagerDuty Joins AWS QuickSuite: Connect Your Incident Management with 1,000+ Applications

Today, we’re announcing that PagerDuty is now available in AWS QuickSuite through the Model Context Protocol (MCP). This means PagerDuty’s incident management capabilities can now connect with the 1,000+ applications and data sources that QuickSuite integrates with, from AWS services to enterprise SaaS platforms, all accessible through natural language.

Kubernetes Security Guide: Risks, Strategies, And Tools

In 2018, attackers gained access to Tesla’s AWS cloud environment through an unprotected Kubernetes console (admin console). Because it lacked proper authentication, the hackers could see and control cluster resources. Once inside, they deployed new pods running cryptocurrency mining software, using Tesla’s compute power for profit. During the breach, the attackers also uncovered credentials stored in the cluster.

25 Sumo Logic updates to better monitor and secure your Azure environments

If you manage workloads across multiple clouds, you know how easy it is for critical alerts or performance issues to get lost in the noise. Switching between consoles, correlating logs, and tracking metrics across platforms can slow down troubleshooting, delaying incident resolution and increasing risk of missing critical alerts.