Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

SRE Availability Metrics

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.

Understanding a Microsoft Service Outage

Maintaining business continuity when an issue arises has proven to be a challenge many organizations struggle with. A global pandemic being thrown into the mix in Q1 of 2020 (one that many businesses are still navigating through) introduced a new set of problems for both service providers and businesses reliant on those services.

Enhance NOC Alerts With Incident Management and Alert Automation

In a network operations center (NOC), alerts originating from hundreds of servers, application monitoring systems, emails and ticketing services compete to catch a NOC analyst’s attention. NOCs face many challenges in parsing through alerts to identify actionable notifications and mobilize the right response team into action.

SRE Leaders Panel: Business Agility is what matters, SRE can help you get there

Blameless recently had the privilege of hosting SRE leaders Garima Bajpai, Founder at Community of Practice - DevOps Canada and Jason Fraser, Delivery Lead at VMware Tanzu to discuss the value of crisis during incident response, the best and worst tech transformations they’ve seen, how reliability impacts the flow of value, and more.

Concrete Steps to Reducing MTTR

In today’s data-centric world, metrics or numbers define all performance benchmarks. The time between when an event starts and ends shows how well a system can handle and process such events. One of such metrics is MTTR. MTTR usually stands for Mean Time To Resolution, but it has held several meanings over the years. MTTR is a metric used to measure how well a system can bounce back from errors and provide long-lasting solutions.

Monthly Moo Update | April 2021

I don’t know about you, but April traveled at the speed of light. A blink and it happened. Our teams have been working at the same speed throughout one of our favorite months of the year. With an incredible amount of updates, we’ve made our product even more transparent and easier to use. It’s not just our world-class documentation that enables you, it’s also the in-product visualizations and enablement that help guide you without you even realizing it.

Top SRE Toolchain Used By Site Reliability Engineers

We have compiled a list of the most popular and sought out tools (some you may have heard of) that SREs need in their toolkit - at every phase of a production system to keep up with SRE best practices Site reliability engineering (SRE) practices help organizations by ensuring smooth functioning of their deliverables with utmost reliability and resilience. These can be achieved by a set of well-defined tools that are deployed at every phase of the production system to keep up with SRE best practices.