Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Q&A with Alex Hidalgo on SLOs

Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’Reilly Media. The first three chapters of the book are available now through O’Reilly’s early access program. I had a chance to read those chapters and ask Alex some questions about service level objectives and reliability. Thanks, Alex, for sharing your knowledge.

Modernizing and Consolidating Your Monitoring Without Losing It...

The current days of remote work and “IT Ops from home” may or may not be here to stay, but they definitely reinforce the need for consolidating and modernizing our monitoring. The challenges which multiple siloed tools create for understanding the big picture are only exacerbated by having just one screen to look at when monitoring our IT from our kitchen table.

Coronavirus: From the Office to Working From Home

Coronavirus (COVID-19) is greatly impacting the lives of organizations, employees and stakeholders. With the outbreak’s rising impact, more employees are migrating to remote, work-from-home practices as means of achieving “social distancing.” However, inevitable challenges are emerging with remote workdays. Obstacles include, but aren’t limited to, employee isolation, diminished productivity and poor team communication or collaboration.

Best Practices for Pragmatic Incident Command

The goal of this piece is to provide some practical advice on how teams can coordinate and respond to complex, dynamic incidents. After all, incidents are unplanned investments that surface valuable learnings for improvement. For the purposes of this blog, we define incidents as situations where there is a need for coordination among multiple people working on the same problem. There will be incidents where this is not the case.

IT Operations in the Age of Coronavirus

Coronavirus has been a shock to the system for many IT organizations that are traditionally accustomed to working together in person. When you’re in an office, you can often use informal methods of communication—like swinging by someone’s desk, calling them on their office extension, or even imparting critical information when you run into them in the company cafeteria.

Keep the lights on - how Derdack facilitates remote work, communications and anywhere operations

We’ve already heard from a couple of customers that continuous operation of critical IT is a ‘must’ in our turbulent times. Not only to facilitate remote work and home office for an unprecedented number of workers but also to keep critical IT infrastructure going. Our customers attest that Enterprise Alert is most critical for them in today’s challenging times and it helps them to stay afloat and be fully operational from their home offices.

Leverage JIRA with Squadcast throughout the incident lifecycle

Atlassian’s Jira is an issue and project tracking software that helps plan, track and manage projects. Jira is also used by customers and internal teams to log issue tickets for the product and engineering teams to look into and resolve. This forms a feedback loop between the customer-facing and product teams to help drive and deliver the best possible software. Jira is widely adopted by Agile development teams to customize workflows and embrace collaborative resolutions to ship good software fast.

COVID-19 Pandemic: How to Use AlertOps to Keep Your Enterprise Running

The coronavirus (COVID-19) pandemic has forced many global enterprises to temporarily shut down their operations, resulting in lost productivity and revenue losses. Yet, with a business continuity management (BCM) strategy, enterprises are well-equipped to limit business interruptions until the pandemic passes.