Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

The Lifecycle of a Service

Services are the backbone of our systems. Whether they’re functional microservices or logical components of a traditional application, they are the pieces that make up our businesses. We can’t do the computer thing without services. But who’s responsible for owning a service in your company or organization? The cast of characters involved in the lifecycle of a service is more than just software engineers.

"Homegrown" May Be Good for Tomatoes, Not So Much for IT Ops

In the past, many organizations grew and managed their own data centers. Some still do. And many are still developing their own automated incident management (aka Autonomous Operations) tools. But as IT grows and becomes evermore complex and fast-moving, the reality of what it means to do so kicks in, and organizations are re-evaluating their strategies.

Evolving Blameless' SRE Practices with Amy Tobey

At Blameless, we drink our own champagne, and aim to adopt a mindset of continuous learning to foster resilience. We believe that the adoption of SRE practices is one of the best ways to get there. Like most organizations, our early efforts to implement SRE were imperfect. However, through hard work, teamwork, and investing in what we believe is the most important feature (reliability), we have made significant changes to how we do SRE. And we’re getting better at it every day.

Scheduling IT and Engineering on-call rotations just got easier

It shouldn’t take you more time than a few seconds to understand your on-call schedule and rotations and how you could make changes to it. It is important for on-call scheduling and alerting tools to make this as simple as possible. If you’re spending more than a few seconds to understand what your on-call rotations are going to be like for the next day or week or month, then you need to start looking for a better on-call management tool.

Importance of After-Hour Response Teams

Exceptional customer service is key in the world of IT, where something could go wrong at any given moment. This level of support equates to business retention, client satisfaction and high success rates and profits. In this post, I’ll introduce a hypothetical scenario, where “MSP Team A” provides 24×7 after-hours support to a valuable client.

Structuring Your Teams for Software Reliability

How well positioned is your team to ship reliable software? What are the different roles in engineering that impact reliability, and how do you optimize the ratio of software engineers to SREs to DevOps within teams? These questions can be hard to answer in a quantifiable way, but projecting different scenarios using systems thinking can help. Will Larson’s blog post Modeling Reliability does just that, and serves as inspiration for this article.

Got Game? Secrets of Great Incident Management

When his phone wakes him at two in the morning, operations engineer Andy Pearson knows it’s bad news. There’s a major server problem, and hundreds of client websites are down. Automated monitoring checks detected the outage within seconds, and paged the on-call engineer. This time, it’s Pearson in the hot seat. Pearson quickly confirms the issue is real and, escalates it to his boss, tech lead Lewis Carey.

Incident Response - how great companies do it

An incident response plan is a pre-devised action stratagem for IT teams on how to respond to critical IT events efficiently. As modern applications continue to grow in scale and complexity, there will be more people working on more interdependent systems, consequently, the question is not if a system will fail, but when, and how best to respond.