Latest News

SREview Issue #1 May 2020

May 29, 2020 By Blameless Community In Blameless

Welcome to the SREview! This zine will feature epic Tweets, content, and events happening in the SRE and resilience engineering community throughout the month.

Read Post

Blameless

Read more about SREview Issue #1 May 2020

Kubernetes Operators for Automated SRE

May 27, 2020 By Squadcast In Squadcast

It can be quite challenging for an SRE team to maintain the well-being of a large-scale Kubernetes based system with hundreds or thousands of services. In this blog post, Gigi Sayfan, author of “Mastering Kubernetes”, outlines the SRE challenge and how we can achieve the ultimate goal of automated SRE with Kubernetes operators.

Read Post

Squadcast

Read more about Kubernetes Operators for Automated SRE

Release Notes: Stakeholder Engagement, Uptime Monitoring API, Flexible Periods for Schedules, and more

May 27, 2020 By iLert In iLert

Nowadays, a working digital infrastructure is the lifeblood of almost any organization. The impact of a major IT incident can go far beyond the IT department, affecting a company’s revenue or incur costs in other areas of the business caused by service disruption. Therefore, in addition to the technical response to a major incident from the IT department, business stakeholders need to be involved as well, so they can prepare the business response.

Read Post

iLert

Read more about Release Notes: Stakeholder Engagement, Uptime Monitoring API, Flexible Periods for Schedules, and more

Using context to triage change-triggered incidents

May 27, 2020 By Vishwa Krishnakumar In Zenduty

One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical to get a first responder’s view of what happened and what could possibly have caused it. Context tells you what happened before an incident. In the case of 40–50% of all incidents, Zenduty’s incident context can tell you within 5–10 seconds, what could be the cause of an incident.

Read Post

Zenduty

Read more about Using context to triage change-triggered incidents

How to Add Incident Alert Management to Your DevOps Pipeline

May 27, 2020 By Ritika Bramhe In OnPage

DevOps pipelines enable teams to implement continuous software development processes, often by using automation and collaboration tooling. The overall goal is to quickly release software products, updates, and fixes. To ensure a DevOps pipeline works well, teams add management and monitoring tooling to the pipeline. This includes incident alert management, which supports the team’s efforts in monitoring the security of various software and environment components.

Read Post

OnPage

Read more about How to Add Incident Alert Management to Your DevOps Pipeline

Introducing Blameless Service Level Objectives

May 26, 2020 By Blameless Community In Blameless

Over a year ago, Blameless launched the industry’s first end-to-end SRE platform to help software teams innovate without sacrificing reliability. As Service Level Objectives (SLOs) provide an anchor for reliability targets and corresponding decisions, they are the foundational step toward helping teams truly adopt SRE best practices. Today, we are very excited to announce our new SLO platform, giving teams a shared language on how to focus their engineering efforts.

Read Post

Blameless

Read more about Introducing Blameless Service Level Objectives

Spring 2020 Launch: New Capabilities for a New Digital Era

May 26, 2020 By Ariel Russo In PagerDuty

The ongoing pandemic and resulting economic downturn have led to dramatically changing market conditions. As a consequence, technology teams have become increasingly concerned with the need to minimize their financial risk and reduce costs to mitigate the effects of abruptly pivoting to a fully remote working environment. For some, there has been a struggle to maintain business continuity—i.e., keeping the physical components of the business running when everyone is working from home.

Read Post

PagerDuty

Read more about Spring 2020 Launch: New Capabilities for a New Digital Era

Helicopter Services Company Improves Incident Response by 90 Percent With OnPage BlastIT

May 26, 2020 By Christopher Gonzalez In OnPage

Efficient team communication requires the proper set of tools and processes, ensuring that the right persons receive timely messages. This way, recipients are well-informed of a critical issue, while having time to address the incident. Unfortunately, a large helicopter services company relied on time-wasting procedures to communicate with stakeholders, resulting in delayed incident response and resolution.

Read Post

OnPage

Read more about Helicopter Services Company Improves Incident Response by 90 Percent With OnPage BlastIT

Business Continuity Planning and Effective Communication - by Laura Toplis

May 25, 2020 By Ronald In SIGNL4

With many companies utilizing remote-working during the COVID-19 pandemic, effective communication is more important than ever. Unfortunately, being in the middle of responding to a global pandemic will not prevent your organization from suffering from other business disruptions. Likely disruptions you may face are: Cyber/ phishing attacks – these attacks can cripple your regular communication methods such as email, or may exploit ineffective communications to extract illegal payments.

Read Post

SIGNL4

Read more about Business Continuity Planning and Effective Communication - by Laura Toplis

Real-time alerts from Zabbix and escalation with Zenduty

May 21, 2020 By Vishwa Krishnakumar In Zenduty

Recently, one of our customers, a 20-member NOC team of a large B2C company, had set up Zabbix to monitor a network of over 1000+ servers, routers, and switches. The NOC team wanted to set up alerting, on-call scheduling, and an escalation matrix whenever a critical network component encountered any downtime. The NOC team used Slack as the primary communication channel and Zoom for real-time communication. For NOC teams like these running a very large operation, setting up alerting can be very tricky.

Read Post

Zenduty

Read more about Real-time alerts from Zabbix and escalation with Zenduty

Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

SREview Issue #1 May 2020

Kubernetes Operators for Automated SRE

Release Notes: Stakeholder Engagement, Uptime Monitoring API, Flexible Periods for Schedules, and more

Using context to triage change-triggered incidents

How to Add Incident Alert Management to Your DevOps Pipeline

Introducing Blameless Service Level Objectives

Spring 2020 Launch: New Capabilities for a New Digital Era

Helicopter Services Company Improves Incident Response by 90 Percent With OnPage BlastIT

Business Continuity Planning and Effective Communication - by Laura Toplis

Real-time alerts from Zabbix and escalation with Zenduty

Monthly Archive

Follow Us