Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

February product updates. Anomaly detection, incident management and better UX

This past couple of month were a bit hectic and there’s a good reason behind that. We’ve set out to create a better experience for our users, and it’s exactly what we did! Besides making a lot of quality-of-life changes, we’ve introduced new features that we think, will speed up the speed at which you’ll debug your applications and give you a whole new perspective on all things AWS Lambda.

IT Ops reporting is broken BigPanda Unified Analytics can help

Your IT Ops execs and your service owners want reports that show easy-to-understand reports on: Application and service uptime and performance, IT Ops and NOC team performance & Incidents by source, severity and other parameters. To do this, your IT Ops team is probably wasting precious hours every week, wrangling with spreadsheets and general-purpose reporting tools are hard to use and update. BigPanda Unified Analytics can change all of that. ..hours that your IT Ops team doesn’t have!

Postmortems Part 2: How to Adopt a Learning Culture

Culture is the way we do things together. It’s the secret sauce that results in happy, healthy teams that consistently meet their goals. It’s also the hardest thing to define, cultivate, and change in an organization. True cultural change requires more than creating and communicating policies. It takes collaboration, persistence, and experimentation.

How to Take Business Continuity Tests To The Next Level

The importance of effective business continuity planning (BCP) cannot be understated. Being able to avoid and mitigate the risks and damages associated with a disruption to operations is critical to the health of any business. And, the two main pillars upon which a robust BCP program rests are, of course, the plan and the testing program.

Introducing The PagerDuty Postmortem Guide

Your team had been fighting this major incident for hours, but your investigation was hitting one dead end after another. Finally, you managed to isolate the problem and your graphs started to improve. When all systems went back to normal, everyone let out a collective sigh of relief, shut down the response call, and went back to bed, never to think of this incident again. Or so you thought.