Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

On-call doesn't have to be stressfull

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.

The importance of GameDays

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable.

Site Reliability Engineering-Why you should adopt SRE

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Relationships between Operation and Devlopment Teams

Modern businesses are evolving rapidly with the advent of cloud, CI/CD and microservices. However, there still exists an extensive and obvious divide between principle business stakeholders and developmental teams. Development teams are often unaware of the challenges faced by operations teams and vice-versa. This is where a need for adoption of DevOps principles comes into the picture. DevOps which came into existence as the natural successor to Agile practices in software development.

ChatOps-The future of collaboration

ChatOps is the implementation of chatbots to unify communication and collaboration. Through ChatOps every single member of a team will be aware of what the other members are working on. It is the logical next step in the evolution of communication among teams after email and IM. Projects of today are developed at a global scale with millions of people as potential users, this means that teams are larger and often work in shifts or even remotely.

Post Mortems- Bringing clarity to incident reviews

An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail?. A post mortem is a post-incident activity to help organizations understand how the incident happened and to learn from it. Service incidents are an unavoidable hurdle for any company when they do happen, the teams working will be wholly focussed on restoring service as quickly as possible.

The importance of Incident Roles

Modern technology organizations are required to be adaptive in their approach to incident management. A single project will have multiple teams working as different branches on integrated systems. Even if all the members have unified communication channels when an interruption occurs in the service there’s bound to be chaos. The frontline response team will have to be on their toes to get to the root issues at the first signs of trouble.

Fostering blamelessness at the workplace

An integral lesson every business (of any size) learns is that failure is inevitable at some point in the production cycle. There might be times where things go haywire at critical junctures sending teams scrambling to rectify the root issue and reinstate service. The underlying causes are often many and varied especially in large scale systems with complex architecture and interdependence.