Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Monitor Your ZFS Volume Manager With Telegraf

ZFS (Zettabyte File System) is a file system and volume manager that has robust data integrity features and uses checksums for every block of data, ensuring that any data corruption is detected and corrected. Additionally, it offers advanced features such as pooled storage, efficient snapshots and cloning, built-in data compression, deduplication, and high scalability, making it ideal for large-scale and high-performance storage environments.

Enhancing Git Management in Python Projects

Git is an essential tool for version control, whether you are a developer or an IT pro. Git allows engineers to track changes, collaborate, and manage their code effectively. However, for beginners, navigating Git can be daunting. Enter GitLens, a powerful Visual Studio Code (VS Code) extension designed to enhance Git capabilities and simplify Git management.

[New] Schedule Overrides is now live for every team member!

We are excited to announce a significant enhancement to our scheduling feature based on your valuable feedback! At Zenduty, we understand the importance of flexibility and efficiency in managing on-call schedules and ensuring seamless incident response. Previously, only team managers had the capability to edit schedules and add overrides. This meant that non-manager team members had to reach out to their managers to request override coverage, potentially delaying critical adjustments.

Going for gold: Testing the resilience of Olympic websites

As the world gears up for the Paris Olympics, it’s not just athletes who need to be in peak condition. This Olympics comes hot on the heels of the largest IT outage in history. Recovery efforts from the CrowdStrike outage are still ongoing. Lessons will be learned, no doubt, but at least one takeaway is already evident: the modern web is an oh-so-fragile thing; neglect digital resilience at your peril.

AKS Cost Optimization: How To Lower Your AKS Costs

Cloud-native applications continue to evolve and grow in complexity. And that complexity hurts the most when managing Kubernetes costs in Azure. AKS cost optimization may seem obvious, but it might also seem difficult to achieve. Microsoft’s fully managed Kubernetes service can help you run, manage, and deploy containerized applications. And while it optimizes performance, it can cause unexpected costs when improperly managed.

Monitor Amazon MemoryDB with Datadog

Amazon MemoryDB for Redis is a highly durable in-memory database service that uses cross-availability-zone data storage and fast failover, providing microsecond read times and single-digit-millisecond write times. Datadog’s integration for MemoryDB uses a range of metrics to provide important visibility into MemoryDB performance.

MongoDB use cases for the telecommunications industry

A trusted database is fundamental to the smooth and secure operation of telecommunications services:, from network management and customer service to compliance and fraud prevention. MongoDB is one of the most widely used databases (DB Engines, 2024) for enterprises, including those in the telecommunications industry. It provides a sturdy, adaptable and trustworthy foundation. It also safeguards sensitive customer data while facilitating swift responses to rapidly evolving situations.

How our data team handles incidents

Historically, data teams have not been closely involved in the incident management process (at least, not in the traditional “get woken up at 2AM by a SEV0” sense). But with a growing involvement of data (and therefore data teams) in core business processes, decision making, and user-facing products, data-related incidents are increasingly common, and more important than ever.

Leveraging AI for Efficient On-call Scheduling

Regardless of industry specifications, creating and maintaining a highly functional incident management process is crucial for organizations of all sizes. The various potential applications of Generative AI in this process can significantly enhance the efficiency, accuracy, and speed of incident detection, analysis, and resolution. GenAI can be utilized across all stages of the incident management process, including preparation, response, communication, and learning.

How Network Observability Helps Lay the Foundation of Autonomous IT Operations

We often hear the term "observability" in the context of DevOps and how SREs use telemetry data. Collecting and analyzing this telemetry data is a vital first step to a successful autonomous IT operations strategy. Observability can help you find out about problems in your system you didn’t know you had—and before your users are impacted—by giving you new visibility that your monitoring systems don’t provide. But any observability initiative must also include network observability.