Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Incident management with Microsoft Teams and Zenduty

Teams is Microsoft’s versatile chat and collaboration solution for enterprise communication. Teams come bundled with Office365, offering chat, file sharing, and a host of other collaborative features. The platform also integrates with a host of popular project management applications, chatbots, and alert management platform makes it a hot favorite of production teams.

Application peformance monitoring with Datadog 2020

Datadog is an application performance monitoring and analytical SaaS for cloud infrastructure. Datadog enables DevOps teams, SREs and IT operation teams to optimize their systems for uptime and availability. Modern services generate massive amounts of data from all of the different services and technologies, Datadog supports over 400+ integrations and collects data for improving visibility across dynamic production environments.

Incident Response - how great companies do it

An incident response plan is a pre-devised action stratagem for IT teams on how to respond to critical IT events efficiently. As modern applications continue to grow in scale and complexity, there will be more people working on more interdependent systems, consequently, the question is not if a system will fail, but when, and how best to respond.

Monitoring with New Relic- Everything you need to go to get started

DevOps is an organizational philosophy that enables continuous delivery and continuous deployment with a focus on continuous testing, automation and collaboration among dev teams, business, and operations teams. Consequently, continuous monitoring is also a key phase of the DevOps lifecycle, which is where application performance monitoring tools come into the picture. APM tools enable developers to monitor user experience in real-time with an eye on the health and stability of their applications.

Grafana- Everything you need to know

Grafana is an open-source platform for data visualization, monitoring, and analysis. It's designed around providing context-rich visualizations, mainly though graphs but also supports other ways to present data through pluggable panel architecture. Every dashboard is versatile and custom-buildable for specific projects of software development or business requirement. Grafana’s beautiful dashboards are one of the reasons Grafana is so popular with users.

Site reliability engineering- Predictions for 2020

As we head into 2020, it's clear that DevOps has finally crossed the divide and gone mainstream. With DevOps firmly ingrained as a standard practice, we now look at how it will evolve. DevOps is driving more overall alignment between development and operations teams than has ever existed in the past. For developers, that means building and delivering impeccable apps to market quickly.

Incident Alert Routing - Getting woken up only by alerts that matter to you

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.

Making on-call superheros

Building a world-class service is as much about maintaining software as it is about developing it. On-call engineers are typically responsible for ensuring the reliability and availability of your service i,e your reputation, and source of revenue. Robust on-call schedules ensure that the right people are ready-to-go during times of crisis. Organizations continue to depend on on-call schedules and incident response processes that are a source of stress/anxiety or panic to employees.

Incident Response 2.0 - The Zenduty Incident Command System(ICS)

We are super excited today to introduce our latest Zenduty integration with Slack, which we are calling the Zenduty Slack Incident Command System(Slack-ICS). This was many months in the making and went through multiple iterations and it is something we believe will redefine proactive incident management and response.

Incident Alert Routing - reducing noise and getting woken up only by alerts that matter

Site reliability engineers have one of, if not the, toughest roles in any organization. While dealing with incidents is one part of the job, the other is to build reliable systems. Google’s SRE book sums this approach nicely. One of the most important challenges for an SRE when it comes to balancing work between firefighting and toil reduction is the issue of alert noise.