Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

5 Best Practices for Resolving Errors Quickly

I love writing software, but I hate dealing with bugs. They take you away from what you want to be doing and often lead you into a rabbit hole. At Sentry—an open-source error tracking platform that provides complete app logic, deep context, and visibility across the entire stack in real time—we have a few tips that we’ve honed over time to make error resolution painless (ok, less painful), including an official integration with PagerDuty.

Improving Hospital Workflow with OnPage Alerting

At many points in a hospital’s functioning, workflow touches the outcome. The problem facing much of healthcare though is that the established workflow for alerting and messaging physicians is broken. What are ways for improving scheduling doctors? What are the potential impacts from improvement?

Incidents as we Imagine Them Versus How They Actually Are with John Allspaw

There is a tendency to imagine (or remember!) incidents as unfolding much neater and orderly than they actually are. Events can lead some engineers scratching their heads about what is happening, while their teammates can instead be confused about how it's happening.

Real-Time Operations Maturity: How Businesses Can Thrive in the Digital Era

It’s rare to find a business today that doesn’t rely on digital technologies and services. Retail is one example: Whether customers are buying online or in store, completing a transaction requires a website or point-of-sale system. The entire supply chain relies on IT services to deliver goods on time, to the right locations, and just like any company today, every department —from development and marketing, to HR and business services—has a critical tech stack.

Machine Learning in IT & Digital Operations: Why Now, And What to Keep in Mind

You’ve just recovered from a critical application outage and your team is being asked to report on root cause and recommended remediation steps later this afternoon. Can you quickly analyze all the data, identify all the leading events, and discern which one was responsible for the cascading failure?