Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Alarming and Incident Reaction on Azure - An architecture Guide for Enterprise Alert on Azure by Patrick Fontana

More and more companies move business critical communication instruments into a cloud based environment. This could be established in a partner datacenter or in a public cloud environment. The main deciding factors between these two options are the trust to the provider and the costs of the solution.

Leadership and Innovation with Instacart's VP of Infrastructure

Blameless CEO Ashar Rizqi recently had the pleasure of interviewing Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.

Adam Frank Demos Moogsoft Express June 24, 2020

As part of a live launch event, Adam Frank, Moogsoft's VP of Product and Design, demoed the latest AIOps & Observability solution for cloud-first companies: Moogsoft Express. Moogsoft Express helps DevOps and SREs detect app performance problems, keep software pipelines humming and honor customer SLAs — all while being extremely simple to use.

Promoting Continuous Learning with SRE

With the extreme changes we’ve all been through these last several months, it should come as no surprise that our jobs have changed drastically, too. We’re working remotely. We’re dealing with increased resource constraints. Our services are receiving more traffic than usual, and we’re tasked with keeping things up and running. Our work-as-done may not match what we did at the beginning of 2020.

Building Automated Monitoring with Icinga and iLert

How many servers can be managed by one system administrator? This question is pretty hard to answer since it depends decisively on the tasks that need to be operated. It is clear, however, that the amount of servers one engineer can manage has increased tremendously over the time, and is still growing. Public and private clouds, in combination with automation tools, enables us to automate many daily tasks. In a modern IT infrastructure almost everything can, and should, be automated.

Sending Nagios alerts to Microsoft Teams and rapid incident response with Zenduty

Nagios is one of the most widely used open-source network monitoring software used by thousands of NOC teams globally to monitor the health of a vast array of their hosts and services. Most teams rely on Emails as their primary Nagios alert notification channel, which may take a few minutes to respond to by your NOC team.

FYI: Email Alerting Isn't Enough

Email alerting is an inefficient way to receive and address critical alerts. Email inboxes tend to get flooded with “clutter,” as irrelevant messages bury urgent incident notifications. Incident management procedures require incident management systems, ensuring that urgent issues are immediately addressed. Yet, some services are reluctant to say goodbye to email alerting and its inefficiencies. This is the case with Google Voice, which recently solidified its commitment to email alerting.

Event Chaos or Enrichment? BigPanda's CTOs Can Help You Decide

In our recent “IT Ops Demystified – Event Chaos or Enrichment?” webinar our field CTOs discuss how enrichment can help reduce operational costs by an order of magnitude. Here is a quick overview of all the goodness that you’ll be watching.