Sometimes, two concepts overlap so much that it’s hard to view them in isolation. Today, incident management and problem management fit this description to a tee. This wasn’t always the case. For a long time, these two ITIL concepts were seen as distinct—with specialized roles overseeing each. Incident management existed in one corner and problem management in the other. Then came the DevOps movement and the lines suddenly became blurred. So where do they stand today?
Incident management is a critical aspect of IT service management (ITSM) that revolves around restoring normal service operations as swiftly as possible after an unplanned interruption or reduction in quality. Also referred to as “incidents,” these interruptions could range from a minor issue like a single user being unable to access a service to a significant problem such as a server crash or network outage affecting many users.
In this article, we will be covering how to monitor Kubernetes using Graphite, and we’ll do the visualization with Grafana. The focus will be on monitoring and plotting essential metrics for monitoring Kubernetes clusters. We will download, implement and monitor custom dashboards for Kubernetes that can be downloaded from the Grafana dashboard resources. These dashboards have variables to allow drilling down into the data at a granular level.
Prometheus is becoming a popular tool for monitoring Python applications despite the fact that it was originally designed for single-process multi-threaded applications, rather than multi-process. Prometheus was developed in the Soundcloud environment and was inspired by Google’s Borgmon. In its original environment, Borgmon relies on straightforward methods of service discovery - where Borg can easily find all jobs running on a cluster.
Before we jump into the specifics of Grafana and Datadog, let's look at the main comparison points. Grafana is a great dashboard that allows you to plug in essentially any data source in the world. Grafana is most commonly paired with Prometheus, Graphite, and Elasticsearch to provide a full APM, time-series, and logs monitoring stack.
Alexander is Senior SRE at Prezi, a video and visual communications software company. As a team, the Prezi SREs provide multiple services within the company. One of those is the observability stack where Prezi heavily relies on Grafana. Companies are always evolving to run more smoothly, serve their customers better, and operate in a way that is cost-effective.