Operations | Monitoring | ITSM | DevOps | Cloud

Alerting

Efficient task management for remote/work-from-home teams

As COVID-19 continues to impact communities globally with health care professionals working tirelessly to prepare for emergencies and prevent the further spread of the pandemic, technology companies are also doing their part. Twitter, Google, and Amazon have issued directives instructing employees to work from home as the companies themselves move to pull out of tech events while hosting their own events virtually.

Remote Work: Splunk + Zoom

As everyone is taking proactive measures to stay healthy, organizations are increasingly having their employees work from home. At Splunk, we are focused on bringing data to every question, decision and action — and remote work for us equals Zoom for online meetings and workspaces. As our customers use Splunk for real-time data processing and analytics, they use our Splunk Mobile App (Android, iOS) when they need to take their dashboards on the go.

6 Steps to a More Effective Postmortem

Detailed and specific description of impact? Check. In-depth root cause analysis? Check. Clearly defined and easy to follow resolution? Check. Postmortems present an incredible learning opportunity, despite the inherent cost of time and effort. They ensure an incident is documented, that all contributing factors are understood, and that effective preventative actions have been put in place to reduce the likelihood or impact of recurrence.

Incident management for remote/WFH teams

As the world tries to battle COVID-19, most of our customers here at Zenduty have started implementing social distancing measures within their companies by asking all their employees, including the NOC, SRE, ITOps, Support, and software engineering teams to work remotely or from home. While that may appear to be a drastic change in your day-to-day operations, it need not disrupt your reliability and support operations.

PagerDuty Is for People: Supporting Our Community During COVID-19

Yesterday, we released our earnings during an unprecedented time for society and the market. One of the things I noticed was the collective empathy we experienced as we talked to different teams and companies in preparation, and in our analyst call backs, where to a person, everyone kicked off their call by wishing each other good health and safety. It reminded me that when we are all in this together, not only are great things possible, but it also feels less daunting and more manageable.

Our Top 5 On-Call Practices

On-call: you may see it as a necessary evil. When responding to incidents quickly can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity, but often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock. But does on-call have to be so dreadful? We think not.

Custom Alerts Using Prometheus Queries

Prometheus is an open-source system for monitoring and alerting originally developed by Soundcloud. It moved to Cloud Native Computing Federation (CNCF) in 2016 and became one of the most popular projects after Kubernetes. It can monitor everything from an entire Linux server to a stand-alone web server, a database service or a single process. In Prometheus terminology, the things it monitors are called Targets. Each unit of a target is called a metric.