Operations | Monitoring | ITSM | DevOps | Cloud

Defining and measuring your SLIs and SLOs

Customers expect that online services are available all the time. The truth is that outages happen to almost everyone because providing 100% service availability is challenging and costly. Creating reliable and profitable service is, amongst other things, finding the balance between application availability, costs and time to market. Faster feature delivery means less availability as constant changes to production may cause issues and introduce bugs.

Inside the migration from Consul to memberlist at Grafana Labs

At Grafana Labs we run a lot of distributed databases. These distributed databases all make use of a hash ring in order to evenly distribute workloads across replicas of certain components. For a more detailed description of the architecture of our projects, check out our Mimir architecture docs.

Create and Manage Maintenance Windows Through PagerDuty Mobile App

In order to respond in real-time to urgent, critical digital incidents, on-call responders must be able to take action from anywhere. But when on-call responders become overwhelmed with alerts, they often just “ignore them” because they cannot tell the difference between a real alert and a false one.

Code-level Application Monitoring for Every Developer

The monitoring, tooling, and observability space is crowded. It’s hard to keep track of what most tools in this category originally set out to do— but if we had to guess… they were probably built to support monolithic architectures with complex systems, to give Ops and IT a way to minimize the impact of an outage.

How I monitor cloud application costs in one simple but powerful dashboard

Although there are many great tools out there to get on top of application monitoring, there’s one vital metric that’s often overlooked by us technical folks – cost. In the days of running apps on servers in private datacenters, the kit was a one-time purchase that the systems team had to deal with. But running apps in public clouds is a different story. Whether you’re running on VMs, containers in Kubernetes, or entirely serverless, execution of your code adds to the bill.

DevOps vs. SRE: What's the Difference?

Despite there being significant differences in the roles, DevOps and Site Reliability Engineering are often lumped together because many people assume they do similar work. Although both attempt to reduce the issues arising from software development processes, their goals, skill sets, and approaches are actually quite different. DevOps engineers focus on the development pipeline, and their goal is to enable better development processes and workflows.

Top 20 CI/CD Pipeline Interview Questions & Answers

The key to acing a CI/CD interview is preparation. The first step in preparation is to learn as much as you can about the possible company, including its background, offerings, and hiring practices. In order to help you master your next interview and land your dream job, this blog post includes CICD interview questions, all neatly organized into themes.. Refreshing your technical knowledge is the next item on the list because it will help you stand apart.

Network Log Archiving = Perfect Backwards Visibility

Network monitoring is ideal for getting a real-time view of your connected environment, and with reports, you can look back in time too. Logs are key to this rear-view mirror look, as they contain all the data for all the elements you are monitoring. But without network log archiving, you can only look back so far. Did you know that according to an IBM/Ponemon study, it takes an average of 287 days to discover and contain a data breach?

Sponsored Post

What Are Runbooks and How Does It Apply to Network Operation Centers (NOCs)?

Much like in other production environments, the production of cloud services is based on and orchestrated by a plethora of tools-making part of cloud services' overall cloud infrastructure. Given how cloud services are as complex as they are intricate, a vast range of detailed steps need to be performed in a certain order for the production environment to run smoothly, whether it's carrying out maintenance procedures, updates and upgrades, or resolving issues to prevent downtime.