Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Observability and incident response need resilience testing

There’s a reason why observability and incident response practices have become standard across modern software development. Anyone wanting to minimize downtime and deliver reliable, available applications needs to have fully instrumented systems and playbooks so they can respond quickly and effectively to outages or incidents. But there’s another piece to the reliability puzzle: resilience testing.
Sponsored Post

All-in-One Incident Management: Why Squadcast Trumps Separate On-Call and Alerting Tools

The pressure is on. Incidents happen, and resolving them quickly and efficiently is crucial for meeting your SLAs. But relying on a patchwork of tools for alerting, collaboration, and post-incident analysis can create confusion, delays, and frustration. They can work or may have been working perfect in your company but here are a few factors to consider: The list of questions can go on differing from organization to organization. These are just a few factors that can help you evaluate whether your current tools are truly effective for Incident Response, or if it's time to switch to a unified solution like Squadcast.

Streamline Your Development Workflow with Bunnyshell: Achieve Faster Time-to-Market

In today’s fast-paced software development landscape, maintaining consistent and reliable environments across all stages—whether it’s development, testing, or production—is crucial. The "works on my machine" problem is all too familiar, leading to inefficiencies and delays that can derail your projects. Enter Bunnyshell, a game-changer in the world of environment management that can transform your development workflow and drastically accelerate your journey from code to production.

Customer impacting incidents increased by 43% during the past year- each incident costs nearly $800,000

PagerDuty, Inc. releases study of 500 IT leaders and decision-makers of companies with more than 1,000 employees responsible for IT operations from the United States, the United Kingdom and Australia, that highlights the growing impacts of customer-facing incidents and the ways automation can help mitigate.

Managed Apps on Public Cloud: Why Operations Matter, Part I

You might be tempted to think that running an app on a public cloud means you don’t need to maintain it. While that would be wonderful, it would require help from the public cloud providers and app developers themselves, and possibly a range of mythological creatures with magic powers. This is because any app, regardless of the infrastructure on which it runs or its output, requires maintenance in order to yield accurate and reliable outputs.

Monitoring as Code and Checkly Listed in the Gartner Hype Cycle for the Second Consecutive Year

I'm excited to share that Gartner has included Monitoring as Code (MaC) as an emerging practice to their Hype Cycles for SREs again, the second year in a row. Since we founded Checkly, our vision has been that monitoring should sit in your repository, be codified, and scale with your software development. There is no alternative to MaC as it allows your engineering team(s) to work together, create and maintain checks, and ultimately own their monitoring.

Reliability-Driven Fleet Management with Komodor

Maintaining a few K8s clusters is hard enough. Maintaining 1000+ clusters is virtually impossible without embracing new tooling and paradigm shifts. Join us for an insightful LIVE workshop exploring the possibilities of Kubernetes Fleet Management with Komodor, lead by Itiel Shwartz* In this session, we will dive into the challenges of multi-cluster management and how Komodor's comprehensive platform simplifies operations. Discover how to gain real-time visibility into your clusters, automate routine tasks, and troubleshoot issues across your entire fleet efficiently.