%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Monitor 3rd-party outages in PagerDuty

Aug 8, 2022 By isDown In isDown

We’ve integrated IsDown with PagerDuty so you can manage alerts in the same place you manage all your other alerts. The PagerDuty integration is part of our strategy to make it easy to monitor all the business dependencies that companies nowadays have. We live in a world where SaaS rules the world, and companies prefer to buy vs. build. But with that comes the problem of monitoring all these dependencies, which are critical to daily operations.

Read Post

isDown

Read more about Monitor 3rd-party outages in PagerDuty

GigaOm Radar Report

Aug 5, 2022 By Richard Whitehead In Moogsoft

In June, the research firm GigaOm, published the 2022 edition of their annual Radar for AIOps Solutions, having had time to digest the contents, it seems a good time to summarize the key takeaways from the Moogsoft perspective. Firstly, in case you are not familiar with GigaOm, here’s a brief introduction.

Read Post

Moogsoft

Read more about GigaOm Radar Report

MTTJ - What is Mean Time to Join (MTTJ)?

Aug 5, 2022 By AlertOps In AlertOps

MTTJ – The time taken to join a meeting, and delays caused in ensuring right people are available, can be avoided using software automation and tools. This is not an often talked about topic, but am sure everyone is affected directly from this. We discuss this in detail here. What, why and how it can be avoided?

Read Post

AlertOps

Read more about MTTJ - What is Mean Time to Join (MTTJ)?

Driving a customer-focused incident response process

Aug 4, 2022 By Martha Lambert In Incident.io

Deep into an incident, Slack firing, up to your ears in decisions, not sure where to turn next? It’s easy for external communication with your customers to fall far down the list of priorities in these moments. However, these are the exact situations where comms are vital, and where underestimating their importance can having damaging and lasting effects on your organisation.

Read Post

Incident.io

Read more about Driving a customer-focused incident response process

Episode 6: Mooving to... Real release strategies with Jake Laverty

Aug 3, 2022 By Richard Whitehead In Moogsoft

Every product or application needs a release strategy. It’s how you can double check that everything in your deployment is appropriately tested, validated and verified. Having a standardized release strategy in place allows your team to follow a protocol and reduce the number of unknowns they must face in the product life cycle. However, there are a few considerations to make this critical process run smoothly.

Read Post

Moogsoft

Read more about Episode 6: Mooving to... Real release strategies with Jake Laverty

New! Common Automated Diagnostics for AWS Users

Aug 3, 2022 By Jake Cohen In PagerDuty

Today’s modern cloud architectures centered on AWS are typically a composite of ~250 AWS services and workflows implemented by over 25,000 SaaS services, house-developed services, and legacy systems. When incidents fire off in these environments—whether or not a company has built out a centralized cloud platform—distinct expertise is often a necessity.

Read Post

PagerDuty

Read more about New! Common Automated Diagnostics for AWS Users

The Do's and Don'ts of Blameless Incident Postmortems

Aug 3, 2022 By xMatters In xMatters

When an incident inevitably occurs, many organizations have a well-prepared incident management team that springs into action. Whether it’s a power outage or security breach, an incident can damage your company’s operations if not handled properly. A strong incident response team is critical to mitigating any negative impacts successfully. Furthermore, once your team resolves the problem, you should initiate a postmortem to detail the incident and record any lessons learned.

Read Post

xMatters

Read more about The Do's and Don'ts of Blameless Incident Postmortems

RESOLVE '22: Incident management automation

Aug 3, 2022 By Ryan Taylor In BigPanda

“Make life easier” isn’t a mantra for the lazy—it’s a way to drill down on important automation in the IT Ops room. When Ryan Taylor, VP of solutions engineering at Transposit, talks about his experience and outlook in the IT Ops chair, people tend to listen.

Read Post

BigPanda

Read more about RESOLVE '22: Incident management automation

Automate incident response workflows with Eventarc and Datadog

Aug 2, 2022 By Thomas Sobolik In Datadog

Eventarc is a Google Cloud offering that ingests and routes events between GCP products, such as Cloud Run, Cloud Functions, and Pub/Sub, making it easy to build automated, event-driven workflows in complex environments. By taking care of event ingestion, delivery, authorization, and error handling, Eventarc reduces the development overhead that is required to build and maintain these workflows and helps you improve application resilience.

Read Post

Datadog

Read more about Automate incident response workflows with Eventarc and Datadog

Tell the story of your incident with timeline curation

Aug 2, 2022 By Martha Lambert In Incident.io

It isn’t the first time you’ve heard us say this and it won’t be the last: getting your post-incident process right is a game-changer. Being able to run effective debriefs and create useful postmortems helps us learn from our mistakes, respond better to future incidents and identify how we can build resilience in our product and teams. In short, it’s the thing the shifts the dial from just “fixing” to actually improving.

Read Post