Latest Posts

How to Improve On-Call with Better Practices and Tools

Jul 30, 2020 By Emily Arnott In Blameless

In the era of reliability, where mere minutes of downtime or latency can cost hundreds of thousands of dollars, 24x7 availability and on-call coverage to respond to incidents has become a requirement for the vast majority of organizations. But setting up an on-call system that drives effective incident response while minimizing the stress placed on engineers isn’t a trivial task.

Read Post

Blameless

Read more about How to Improve On-Call with Better Practices and Tools

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jul 29, 2020 By Blameless Community In Blameless

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. He is deeply passionate about how to apply learnings from modern safety science to real, complex socio-technical systems.

Read Post

Blameless

Read more about Enabling the Stripe and Lyft Platforms Through Modern Safety Science

How to Choose Monitoring Tools for DevOps and SRE

Jul 23, 2020 By Emily Arnott In Blameless

When developing for reliability or implementing resilient DevOps practices, the heart of your decision-making is data. Without carefully monitoring key metrics like uptime, network load, and resource usage, you’ll be blind to where to spend development efforts or refine operation practices. Fortunately, a wide variety of monitoring tools are available to help you collect and get visibility into this data.

Read Post

Blameless

DevOps
SRE

Read more about How to Choose Monitoring Tools for DevOps and SRE

Leaders, Here's how to Encourage Full Service Ownership

Jul 22, 2020 By Hannah Culver In Blameless

Service ownership is becoming common practice and its benefits are well-known. These perks include happier customers, aligned teams, and fewer incidents. While this sounds great, it’s often easier said than done, requiring a culture and mindset shift. Leadership will need to encourage and empower teams to adopt the “you build it, you run it” mentality. Here are some ways leaders can help get teams on board.

Read Post

Blameless

Read more about Leaders, Here's how to Encourage Full Service Ownership

SREview Issue #3 July 2020

Jul 21, 2020 By Blameless Community In Blameless

Here’s the July issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.

Read Post

Blameless

Read more about SREview Issue #3 July 2020

How SLOs Help Your Team with Service Ownership

Jul 21, 2020 By Hannah Culver In Blameless

Service ownership is becoming a best practice for teams looking to innovate while maintaining the level of reliability that customers expect. Service ownership means seeing the service through its entire lifecycle. In short, it means you build it, you run it. You’ll be responsible for the service’s security, reliability, performance, and quality. This doesn’t mean you won’t have help from SREs to optimize or automate toil.

Read Post

Blameless

Read more about How SLOs Help Your Team with Service Ownership

The Essential List of Top SRE Resources

Jul 17, 2020 By Emily Arnott In Blameless

Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!

Read Post

Blameless

Read more about The Essential List of Top SRE Resources

5 Tips for Getting Alert Fatigue Under Control

Jul 16, 2020 By Hannah Culver In Blameless

What happens when you receive a notification that something is wrong with your system and you have no clue what it means, or why you’re receiving that alert? Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.

Read Post

Blameless

Read more about 5 Tips for Getting Alert Fatigue Under Control

Leadership and Innovation with Instacart's VP of Infrastructure

Jul 15, 2020 By Blameless Community In Blameless

Blameless CEO Ashar Rizqi recently had the pleasure of interviewing Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.

Read Post

Blameless

Read more about Leadership and Innovation with Instacart's VP of Infrastructure

Promoting Continuous Learning with SRE

Jul 14, 2020 By Hannah Culver In Blameless

With the extreme changes we’ve all been through these last several months, it should come as no surprise that our jobs have changed drastically, too. We’re working remotely. We’re dealing with increased resource constraints. Our services are receiving more traffic than usual, and we’re tasked with keeping things up and running. Our work-as-done may not match what we did at the beginning of 2020.

Read Post

Blameless

Read more about Promoting Continuous Learning with SRE

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

How to Improve On-Call with Better Practices and Tools

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

How to Choose Monitoring Tools for DevOps and SRE

Leaders, Here's how to Encourage Full Service Ownership

SREview Issue #3 July 2020

How SLOs Help Your Team with Service Ownership

The Essential List of Top SRE Resources

5 Tips for Getting Alert Fatigue Under Control

Leadership and Innovation with Instacart's VP of Infrastructure

Promoting Continuous Learning with SRE

Monthly Archive

Follow Us