Operations | Monitoring | ITSM | DevOps | Cloud

Monitoring

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Observing container environments with Cloud Operations

Did you know GKE isn’t the only place you can run containers in Google Cloud? In this episode of Engineering for Reliability, we show three options for running containers, as well as how to instrument each one for observability with Cloud Operations. Watch to learn how Cloud operations can help visualize metrics and analyze logs emitted by container workloads running on GKE, on Cloud Run, and on an Anthos cluster!

Facebook Outage: The Case for Configuration & Change Management

In the age of cloud, digital transformation, application modernization, and the mobile economy, the network is the lifeblood behind enabling excellent customer experiences. Network Operations (NetOps) and IT Operations (ITOps) teams are constantly aware that a disruption in core network systems performance can have a massive impact on their business.

Is service catalog the modern CMDB?

SquaredUp recently launched a PowerShell tile that lets you visualize data returned from a PowerShell script. This has opened virtually infinite doors to the sources you can get data from. PowerShell can work with crazy text formats obscure databases, and endpoints that are open on the internet. If you can access it, PowerShell can work with it. And SquaredUp lets you leverage that power so you can get the information you need and visualize it in a format that makes sense.

What is Digital Experience Monitoring?

Businesses globally have been steadily shifting to digital as early as a decade ago. With the coronavirus pandemic happening, the digital transformation has now shifted into fifth gear. Digital experience is the key to business success. As of 2020, there were almost 30 billion end users that’s connected to the internet. Digital revenue has increased dramatically and digital will surely drive retail sales up.

10 SQL Server Performance Tuning Best Practices

There are a large number of best practices around SQL Server performance tuning – I could easily write a whole book on the topic, especially when you consider the number of different database settings, SQL Server settings, coding practices, SQL wait types, and so on that can affect performance.

A snapshot of my daily work

Today I show you a snapshot of my daily work. It is especially interesting this time, because it’s a not-so simple problem to solve. It’s not difficult per se, but involves quite some understanding of the Icinga Web 2 framework and how it communicates with the web server. Disclaimer: What I’m going to show, is not a feature preview or anything. It’s more of a proof of concept, and it may be that forever and won’t be continued further.

Honeycomb Differentiators Series: SLOs That Tell the Whole Story

In the recent past, most engineering teams had a vague notion of what Service Level Agreements (SLAs) and Service Level Objectives (SLOs) were—mainly things that their more business-focused colleagues talked about at length during contract negotiations. The success or failure of SLAs were tallied via magic calculations (what is “available” anyway?!) at the end of the month or quarter, and adjustments were made in the form of credits or celebrations in the break room.

Mastering AWS identity and access management

From the basic to advanced concepts of AWS own service for identity and access management: users, groups, permissions for resources and much more. For seriously working with AWS, there’s no way around its Identity and Access Management (IAM) service. Skipping to understand its core principles will bite you again and again in the future️. Take the time to do a deep dive, so you won’t be frustrated later.

What is a Site Reliability Engineer (SRE)?

A site reliability engineer, or SRE, is a role that that encompasses aspects of both software engineering and operations/infrastructure. It also encompasses a strategy and set of practices and principles across service offerings and is closely tied to DevOps and operations. The term site reliability engineering first came into existence at Google in 2003 when a site reliability team was created. At that time, the team was made up of software engineers.