Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Managing Squadcast resources with our expanded Terraform provider

Hey folks! We’re excited to announce that we’ve vastly expanded the capabilities of our Terraform provider. Previously, our Terraform provider was limited to creating and managing services as a resource. We have now covered the entire spectrum of resources available on Squadcast right from creating and managing users, escalation policies and also managing SLO’s via our Terraform provider. What does that mean for you?

Using Observability with Kubernetes to Automate Site Reliability Engineering

In this video, Anthony Evans, solution architect, explains how the StackState topology-powered observability platform can help SREs to automate site reliability, putting their organizations on the path to becoming a zero-downtime enterprise. See how StackState helps to unify and correlate data across your stack, visualize your entire IT environment, instantly pinpoint root cause, reduce alert storms and with AIOps capabilities, even prevent problems proactively. It's all here!

What is a Security Operation Center and how do SOC teams work?

With the growing complexity of IT environments, it is essential to have robust security processes that can safeguard IT environments from cyber threats. In this blog, we will explore how security operation centers (SOCs), help you monitor, identify and prevent cyber threats to safeguard your IT environments. This blog covers the following pointers.

What are the four Golden Signals?

When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way?

How to add a Golden Signal to a service in Gremlin RM

In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.

Round Robin Escalation: An Efficient Way to Distribute On-Call Responsibilities

Nowadays, organizations address a high volume of incidents everyday. With so much happening, responders can be overwhelmed by the volume of incidents and may end up de-prioritizing certain important incidents. Hence, it is important to have an efficient on-call scheduling and escalation process in place. In this blog, we will explore how Round Robin Escalations can help distribute on-call load and set up efficient on-call schedules. This blog covers the following pointers.

The SRE's Quick Guide to Kubectl Logs

Logs are key to monitoring the performance of your applications. Kubernetes offers a command line tool for interacting with the control plane of a Kubernetes cluster called Kubectl. This tool allows debugging, monitoring, and, most importantly, logging capabilities. There are many great tools for SREs. However, Kubernetes supports Site Reliability Engineering principles through its capacity to standardize the definition, architecture, and orchestration of containerized applications.

Healthchecks + Squadcast Integration: Routing Alerts Made Easy

Healthchecks is a cron job monitoring service which listens to HTTP requests and email messages ("pings") from your cron jobs and scheduled tasks ("checks"). It lets you update your job to send an HTTP request to the ping URL every time the job runs. When your job does not ping Healthchecks.io on time, then you will receive an alert! If you use Healthchecks for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Healthchecks to the right users in Squadcast.