Understanding Metrics, Events, Logs and Traces - Key Pillars of Observability
Understanding Metrics, Logs, Events and Traces - the key pillars of observability and their pros and cons for SRE and DevOps teams.
The latest News and Information on Service Reliability Engineering and related technologies.
Understanding Metrics, Logs, Events and Traces - the key pillars of observability and their pros and cons for SRE and DevOps teams.
What's the difference between SREs and Platform Engineers? How do they differ in their daily tasks?
Streaming Aggregation and Recording Rules are two ways to tame High Cardinality. What are they? Why do we need them? How are they different?
Site Reliability Engineering is a process of automating IT infrastructure functions, including system management and application monitoring using software tools. It is used by businesses to guarantee that their software applications are reliable even when they receive frequent upgrades from development teams. SRE allows engineers or operations teams to automate the activities that are traditionally performed by operations teams manually to manage production systems and handle issues.
Everything you need to know about Prometheus Remote Write mechanism and storing metrics in long term storage such as Levitate.
Comparison between Prometheus and Datadog - two of the most popular monitoring tools in the market today.
DevOps teams and site reliability engineers (SREs) contend with a never-ending flood of notifications and alerts about outages, potential threats, and other incidents. Companies rely on their DevOps teams to not only keep abreast of all the notifications but also to identify and prioritize the critical alerts and resolve problems in a timely manner. Yet in 2021, International Data Corporation (IDC) reported that companies with 500-1,499 employees ignored or failed to investigate 27% of all alerts.
High Cardinality woes are far & frequent in today's modern cloud-native environment. What does it mean, & why is it such a pressing problem?