Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Honeycomb + Squadcast Integration: Routing Incident Alerts Made Easy

Honeycomb is an application monitoring tool that helps DevOps and SRE teams to operate more efficiently by offering rich observability solutions and intuitive team collaboration. It helps understand complex relationships within your distributed systems and troubleshoot issues accordingly. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities.

SRE Metrics: Four Golden Signals of Monitoring

SRE (site reliability engineering) is a discipline used by software engineering and IT teams to proactively build and maintain more reliable services. SRE is a functional way to apply software development solutions to IT operations problems. From IT monitoring to software delivery to incident response – site reliability engineers are focused on building and monitoring anything in production that improves service resiliency without harming development speed.

DevOps vs SRE - Reducing Technical Debt and Increasing Efficiency and Resiliency

One more blog topic stemming from our weekly office hours that we hold with the field team here at Shipa. In our last office hours, was asked a question about “what are the difference between DevOps Engineers and SREs?”. Both professions are emerging disciplines and cultures that continue to evolve and play an importance in technology organizations. I’ve been fortunate to have written and spoken about this before; though taking a fresh look at what the two domains try to accomplish.

Salesforce Cloud + Squadcast Integration: Routing Detailed Incident Alerts

Salesforce Cloud is one of the leading cloud-based customer relationship management (CRM) solutions. It provides a shared view of your customers and their relationship with the business. With Salesforce Cloud, users can automate service processes and streamline workflows. Squadcast is an end-to-end incident response tool. Built with an SRE mindset, it streamlines all the incident response activities. Squadcast aligns your teams towards a common organizational goal of better reliability.

Severity Levels (What They Are & Why They Matter)

Wondering about severity levels? We explain what incident severity levels are, how to classify them, and how they will affect your incident management process. What are severity levels? Incident severity levels are the measure of the impact an incident will have on a system. In general, a lower number severity level, such as SEV-1, denotes a higher impact on the system.

How to Implement Global View and High Availability for Prometheus

Ensuring that systems run reliably is a critical function of a site reliability engineer. A big part of that is collecting metrics, creating alerts and graph data. It’s of the utmost importance to gather system metrics, from several locations and services, and correlate them to understand system functionality as well as to support troubleshooting.

What Does AIOps Mean for SREs? It's Complicated.

If you’re an SRE, you might view AIOps with great excitement. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier. Alternatively, SREs may choose to view AIOps with disdain. They might think of AIOps as just a fancy buzzword that doesn’t live up to its promises, and that can become a distraction from the SRE tools that really matter. Which perspective is right?