Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Ping Test for Network Connectivity: Simple How-To-Guide

Reliable network connectivity is paramount for uninterrupted communication and efficient data transmission. The ping test is a valuable tool to assess network connectivity, identify potential issues, and troubleshoot them effectively. If you're seeking to troubleshoot network issues or test connectivity between hosts, this comprehensive guide offers step-by-step instructions and valuable insights for performing an effective ping command test.

The "people problem" of incident management

Managing incidents is already tricky enough, and you want to get to mitigation as quickly as possible. But sometimes it feels like organizing everything surrounding an incident is more difficult than solving the actual technical problem and you end up getting delayed or sidetracked during mitigation efforts. We call that scenario the “people problem” of incident management.

Our lessons from the latest AWS us-east-1 outage

In case you missed it, AWS experienced an outage or "elevated error rates" on their AWS Lambda APIs in the us-east-1 region between 18:52 UTC and 20:15 UTC on June 13, 2023. If this sounds familiar, it's because it's almost a replay of what happened on December 7, 2021, although that outage was significantly more severe and took longer to restore.

Synthetic monitoring as Code with Checkly and ilert

This post will introduce Checkly, the synthetic monitoring solution, and their monitoring as code approach. This guest post was written by Hannes Lenke, the CEO, and co-founder of Checkly. ‍ First, thanks to Birol and the ilert team for the opportunity to introduce Checkly. ilert recently announced discontinuing its uptime monitoring feature and worked with us on an integration to ensure that existing customers could migrate seamlessly. ‍ So, what is monitoring as code and Checkly?

Top 5 Use Cases for Custom Fields on Incidents

Chasing down critical information in disparate systems of record while trying to resolve an incident can make an already stressful situation even more taxing. Extra clicks, extra logins, copy/paste, socializing that information with other responders–it all wastes time and introduces more room for human error. Now PagerDuty customers can use Custom Fields on Incidents to enrich their incident data.

Featured Post

The Top 5 Trends on SRE Leaders' Minds in 2023: Insights from a Seasoned Executive

I've spent most of my career trying to solve big problems for people. In the early days at New Relic, we were trying to help people scale their systems based without compromising on performance, cost, or the customer experience. Not an easy feat but we gave them a solution that allowed them to accomplish their goals. The key was religiously listening to our customers talk about their wants, needs, hopes and fears. While I am rarely the smartest person in the room, which my partner rarely misses a chance to lovingly remind me, I always do my best to listen to what the brilliant folks in my sphere are talking about.

New related incidents functionality brings order to the chaos of highly complex incidents

We’ve all been there. You’re working through some rather frustrating blockers during an incident only to discover that you don’t own the dependency at fault. Or, you’ve been pounding away at an issue when a fellow engineer reaches out and asks if your service is affected by some particularly gnarly database failure they’re seeing. But then what? Do you merge efforts and work in parallel or head for a coffee break while the issue gets attacked upstream?

Understanding Major Incident Management: Beginners Guide

A major incident represents a critical event that poses a real or potential threat to an information system's confidentiality, integrity, or availability. Major incidents can disrupt normal operations, impact your customers, and may compromise the security of sensitive data.

Kubernetes Simplified: Understanding its Inner Workings

Kubernetes has revolutionized the world of container orchestration, providing organizations with a powerful solution for deploying, managing, and scaling applications. However, the complexity of Kubernetes can be daunting for newcomers. In this blog, we will demystify Kubernetes by breaking down its core components, revealing its operational principles, and guiding you through the process of running a pod.

What is Zero Trust Security and Why Should You Care?

Automation has become a game changer for businesses seeking efficiency and scalability in a rather unclear and volatile macroeconomic landscape. Streamlining processes, improving productivity, and reducing incidence for human error are just a few benefits that automation brings. However, as organizations embrace automation, it’s crucial to ensure modern security measures are in place to protect these new and evolving assets.