Zenduty

Incident Alert Routing - Getting woken up only by alerts that matter to you

Jan 1, 2020 By Vishwa Krishnakumar In Zenduty

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.

Read Post

Zenduty

Read more about Incident Alert Routing - Getting woken up only by alerts that matter to you

Making on-call superheros

Dec 27, 2019 By Amrit Balraj In Zenduty

Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service i,e your reputation, and source of revenue. Robust on-call schedules ensure that the right people are ready-to-go during times of crisis. Organizations continue to depend on on-call schedules and incident response processes that are a source of stress/anxiety or panic to employees.

Read Post

Zenduty

Read more about Making on-call superheros

Zenduty - Anatomy of an Incident

Dec 16, 2019 By Zenduty In Zenduty

Watch the Zenduty Incident Command System in action!

View Video

Zenduty

Read more about Zenduty - Anatomy of an Incident

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

Dec 15, 2019 By Vishwa Krishnakumar In Zenduty

We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making and went through multiple iterations and it is something we believe will redefine proactive incident management and response.

Read Post

Zenduty

Read more about Incident Response 2.0 - The Zenduty Incident Command System(ICS)

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Dec 10, 2019 By Vishwa Krishnakumar In Zenduty

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.

Read Post

Zenduty

Read more about Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Zenduty - Slack Incident Command System

Dec 9, 2019 By Zenduty In Zenduty

View Video

Zenduty

Read more about Zenduty - Slack Incident Command System

On-call doesn't have to be stressfull

Nov 29, 2019 By Amrit Balraj In Zenduty

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.

Read Post

Zenduty

Read more about On-call doesn't have to be stressfull

The importance of GameDays

Nov 18, 2019 By Amrit Balraj In Zenduty

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable.

Read Post

Zenduty

Read more about The importance of GameDays

Site Reliability Engineering-Why you should adopt SRE

Nov 11, 2019 By Amrit Balraj In Zenduty

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Read Post

Zenduty

Read more about Site Reliability Engineering-Why you should adopt SRE

Relationships between Operation and Devlopment Teams

Oct 16, 2019 By Amrit Balraj In Zenduty

Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development teams are often unaware of the challenges faced by operations teams and vice-versa. This is where a need for adoption of DevOps principles comes into the picture. DevOps which came into existence as the natural successor to Agile practices in software development.

Read Post

Zenduty

Read more about Relationships between Operation and Devlopment Teams

Operations | Monitoring | ITSM | DevOps | Cloud

Zenduty

Incident Alert Routing - Getting woken up only by alerts that matter to you

Making on-call superheros

Zenduty - Anatomy of an Incident

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Zenduty - Slack Incident Command System

On-call doesn't have to be stressfull

The importance of GameDays

Site Reliability Engineering-Why you should adopt SRE

Relationships between Operation and Devlopment Teams

Monthly Archive

Follow Us