Operations | Monitoring | ITSM | DevOps | Cloud

The red team: ServiceNow's first line of defense

If you ask any ServiceNow employee about their role, they'll likely tell you their job and team are the best they’ve ever had. One small but mighty team proclaims this proudly: the red team, a group of professional hackers. As vigilant guardians of the company, the six-person team is tasked with testing the security of our systems and identifying cyber risks, data vulnerabilities, and security threats.

Common Nagios Errors and What to Do about Them

Nagios is an open-source monitoring system that has become indispensable for system administrators and DevOps teams across the world. However, like any other software, you’re bound to come across errors with Nagios. In this article, we’re going to take a look at some common errors and how to solve them, along with the pros and cons of Nagios, and why MetricFire is the perfect alternative for monitoring.

How To Profile and Optimize Telemetry Data: A Deep Dive

We recently had the privilege of presenting our telemetry data pipelining platform at Cloud Field Day. Today, we'd like to share a recap of our demo with you. In this demo, we explore the transformative potential of data profiling, telemetry pipeline optimization, and incident response. Foundationally, we follow an Understand, Optimize, and Respond workflow.

Choosing Azure Database Services - What are the options?

Microsoft Azure offers a choice of relational and non-relational database services to support a wide range of application needs and demands. Built-in intelligence helps automate management tasks like high availability, scaling, and query performance tuning to provide users with services that ensure applications are always available and performant. Many services offer essentially limitless database scale and SLAs (Service Level Agreements) usually range between 99.9-99.999% availability.

How does SIGNL4 provide for truly reliable alerting?

Of course, one expects an alerting solution to be reliable. This is important because a missed alert can have a significant impact on the business. It is about IT uptime, disruptions in production or other critical system conditions. Business processes, production workflows and therefore money, the reputation of the company or even the health of the employees are at stake. But what does reliable alerting actually mean and how is it achieved?

How Flexcity used Grafana Cloud to help balance the national power grid in France

Last winter, Flexcity — a market leader in electric flexibility — faced an unprecedented challenge: Help stabilize the French national power grid, in the midst of a widespread energy crisis that loomed over Europe. As a byproduct of the Russian invasion of Ukraine, energy prices in the EU soared in 2022. And France, meanwhile, faced a nuclear power outage that winter that threatened to significantly disrupt its energy supply and increase the risk of electricity shortages.

Understanding Request Latency with Profiling

It can be hard to figure out why response times are high in Java applications. In my experience, when engineers investigate this type of issue, they typically use one of two methods: They either apply a process of elimination to find a recent commit that might have caused the problem, or they use profiles of the system to look for the cause of value changes in relevant metrics.