Operations | Monitoring | ITSM | DevOps | Cloud

%term

#036 - Beyond Kubernetes: A Radical Vision for the Future of Infrastructure with Adam Jacob (Syst...

Adam Jacob, CEO of System Initiative and original author of Chef, discusses the evolution of infrastructure automation and his career-long passion for infrastructure. Jacob reflects on the history and context of Chef, its emergence alongside EC2, and its role in configuration management. He shares insights into the competitive landscape of configuration management tools like Chef, Puppet, and Ansible, and touches upon the transition of Chef to Progress.

Microsoft Azure is Going Secure by Default. Are You Ready?

Developers aren't lazy – but sometimes cloud service defaults can be. Here’s what to look out for, and how Azure is changing the game. Let’s face it: Developers can sometimes be labeled as “laissez faire” when it comes to security. But is that really fair? In reality, it’s not about being lax or lazy; it’s about the default configurations of many cloud services setting the security bar too low on initial deployment.

Slicing Up-and Iterating on-SLOs

One of the main pieces of advice about Service Level Objectives (SLOs) is that they should focus on the user experience. Invariably, this leads to people further down the stack asking, “But how do I make my work fit the users?”—to which the answer is to redefine what we mean by “user.” In the end, a user is anyone who uses whatever it is you’re measuring.

Ask a Partner About the Mistakes That Were Made

Did you hear about the failed video conferencing project? During a global stakeholder meeting, an IT team deployed new video conferencing software without testing a default echo feature. Every participant heard their voice echoing back, disrupting the meeting until someone identified and fixed the setting. While relatively minor, this incident demonstrates how overlooked technical details can compromise professional operations. Consider the failed ERP rollout by a food distributor.

How to do Agentless Monitoring with check_by_ssh

The fundamentals of Icinga 2 are check plugins. They are being executed and their return value is mapped to either Host or Service objects. Everything else follows on top. These check plugins can be either from the Monitoring Plugins or custom. While their origin does not matter, they are the building blocks of an Icinga monitoring stack. If a plugin goes CRITICAL, Icinga 2 alerts the sysadmin.

Manage All Your App Notifications in One Place with AppSignal

Alerts and notifications are the backbone of any Application Performance Monitoring (APM) tool, ensuring your team is immediately aware of critical issues. At AppSignal, we’re always improving our toolkit to help you stay ahead of problems before they impact performance or reliability. We've made huge improvements to how you can manage your app notifications and alerts with AppSignal.

Ex-Roblox SRE's take on SRE vs. DevOps

Former Roblox Sr. Engineering Manager Denys Pashutynski clarifies the fundamental difference between SRE and DevOps roles: SREs handle the customer-facing production edge while DevOps focuses on background automation.#sre From The Incidentally Reliable podcast - real stories from the trenches of site reliability engineering. Made by SREs for SREs and hosted by Zenduty. Zenduty is a revolutionary incident management platform that gives you greater control and automation over the incident management lifecycle.

The One Thing Most Engineers Don't Understand (But Should)

How can engineering teams have a bigger impact on the bottom line? By thinking beyond code. Most engineers love to build and solve problems. But in a business, building for the sake of building isn’t enough. Even the cleanest code is just an expensive distraction if it doesn’t move the needle.

Diagnosing ActiveMQ broker performance issues with log analysis

Apache ActiveMQ is a widely used message broker that enables seamless communication between distributed applications. However, as the volume of messages increases, performance bottlenecks can arise, leading to slow message processing, high latency, broker crashes, and out of memory (OOM) errors. One of the most critical issues affecting ActiveMQ is OOM errors, which occur when the broker exceeds its allocated heap memory. This can result in service failures, message loss, and prolonged downtime.

How to leverage AI to enhance network monitoring in retail: A CXO's guide

The retail industry has evolved into a mix of physical stores, e-commerce, digital payments, and omnichannel interactions. Now, GenAI has been added to this mix, which changes how people shop, how retailers operate, and how employees work. While this shift creates opportunities for retailers of all sizes, it also presents serious challenges in maintaining network performance and staying compliant with industry regulations.