Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Using Konvoy to Patch your Cluster Infrastructure (Part 1)

Recently we hit the infamous kmem bug in our internal Production Konvoy Cluster. We discovered that we were having this issue after users began reporting a particular CI Job was failing intermittently throughout the Cluster with the following error: From the Pod Logs: From the Kernel Logs.

Network operating system: operation, history and monitoring

In the early days of telephony, there were few people who had phone numbers in every town and city. There were just a few numbers to remember and if your brain failed, you could check with a telephone operator, women who always knew (and know) how to listen. Let’s see how a network operating system was born in the middle of the 20th century, right at the center of telephone networks.

Elastic Training helps UK Driver and Vehicle Licensing Agency better serve motorists

The core responsibility of the UK's Driver and Vehicle Licensing Agency (DVLA) is to maintain more than 48 million driver records, more than 40 million vehicle records, and to collect approximately £6 billion ($7.75 billion) a year in Vehicle Excise Duty. The agency is at the forefront of public digital services, and has made significant progress in transforming its IT systems into new cloud-based platforms.

Real-Time Cost Alerts and Forecasts for AWS

For many companies, cloud costs are among the top investments these days. With a growing number of services, instances and regions, cloud cost optimization is becoming increasingly painful. Companies use cloud management platforms to optimize costs and increase cloud visibility and security. But staying on top of AWS budgets requires proficiency, agility and time—especially when any glitch can result in massive cost bleeds.

Network Operations Center Best Practices (in 2020)

Your Network Operations Center (NOC) is responsible for network monitoring, incident response, and other network operations activities — and you want to optimize its performance. To achieve your goal, your NOC team assesses data and explores ways to improve its everyday operations. The team may also implement NOC best practices or craft some of its own. NOC teams manage network availability and performance, along with servers, databases, firewalls, devices, and related external services.

Top Five Reasons Why Companies Are Choosing OnPage Over Competitors

OnPage’s intelligent incident management system is the alerting solution of choice for industry-leading organizations. Since the beginning, companies have invested in the OnPage system for its advanced capabilities, out-of-the-box integrations and unmatched 24/7 customer support. Though we can provide a comprehensive view into OnPage’s competitive advantage, here are the top five reasons why customers continue to trust OnPage’s incident management system.

Insights for AWS Kinesis and Step Functions now supported by Dashbird

August 2020 marks 3 years of Dashbird and empowering serverless DevOps teams to fully understand their complex serverless infrastructures by enabling them to get full observability and insights into its performance. This birthday month, we have plenty of surprises, giveaways, and goodies in our sleeve over the next few weeks, so sign up for our newsletter to be the first one to know.

Where did all my spans go? A guide to diagnosing dropped spans in Jaeger

Nothing is more frustrating than feeling like you’ve finally found the perfect trace only to see that you’re missing critical spans. In fact, a common question for new users and operators of Jaeger, the popular distributed tracing system, is: “Where did all my spans go?” In this post we’ll discuss how to diagnose and correct lost spans in each element of the Jaeger ingestion pipeline.

How to maximize span ingestion while limiting writes per second to Scylla with Jaeger

Jaeger primarily supports two backends: Cassandra and Elasticsearch. Here at Grafana Labs we use Scylla, an open source Cassandra-compatible backend. In this post we’ll look at how we run Scylla at scale and share some techniques to reduce load while ingesting even more spans. We’ll also share some internal metrics about Jaeger load and Scylla backend performance. Special thanks to the Scylla team for spending some time with us to talk about performance and configuration!