Operations | Monitoring | ITSM | DevOps | Cloud

Top 6 EC2 rightsizing recommendations that you can't ignore

Imagine a day at work where you realize that your team’s youngest developer has failed to kill a compute instance; the bill spikes and the budget is breached. Rightsizing recommendations would come to the rescue and play a crucial role in such situations by identifying underutilized, overutilized, or mismanaged resources and suggesting corrective actions.

Better CloudWatch Metrics in Honeycomb with the OpenTelemetry Collector

CloudWatch metrics can be a very useful source of information for a number of AWS services that don’t produce telemetry as well as instrumented code. There are also a number of useful metrics for non-web-request based functions, like metrics on concurrent database requests. We use them at Honeycomb to get statistics on load balancers and RDS instances. The Amazon Data Firehose is able to export directly to Honeycomb as well, which makes getting the data into Honeycomb straightforward.

A Guide to Retiring SDH & TDM Networks

The telecommunications industry is witnessing a significant transformation as operators realize the urgent need to retire their legacy Synchronous Digital Hierarchy (SDH) and TDM networks. With the advances of packet networks, this migration is not just a necessary transition but a strategic opportunity to enhance service offerings and operational efficiencies.

LightMesh + AWS: Secure Subnet Discovery and Unified IPAM at Scale

LightMesh now integrates with AWS, providing automated discovery and unified management of your cloud networking resources. This powerful integration eliminates fragmented visibility across VPCs and regions, giving you real-time insights into your entire AWS infrastructure through a single, intuitive platform.

Drift Away: The Hidden Risk of Large-Scale Kubernetes Environments

Configuration drift is a silent but persistent challenge in managing Kubernetes environments at scale. Whether you’re running workloads across multiple clusters in on-premises data centers, cloud providers, or edge locations, the risk of drift increases exponentially as environments grow. According to a Komodor survey, 40% of Kubernetes users report that configuration drift negatively impacts the stability of their environments.

Preventing Alert Storms with InfluxDB 3's Processing Engine Cache

A common problem in monitoring and alerting systems is not just alerting on what you’re seeing but preventing alert storms from overwhelming operators. When a system generates multiple notifications for the same incident, it leads to alert fatigue and can mask other important issues. For time series data, alert fatigue can result in missed anomalies, delayed responses to critical trends, and difficulty distinguishing real performance degradations from noise.

Dashboard updates: Fewer clicks, more control, faster widget building

You're reviewing your production metrics when suddenly an error spike appears on your dashboard. Your immediate thought isn't "how do I build a new view to investigate this?" but rather "how do I find out the cause quickly?" This is exactly what happened to one of our engineering teams last month when they spotted an unusual pattern in their API response times. Instead of running ad-hoc queries from scratch, they turned to a custom dashboard they had built after a past incident.

AI On A Budget: Low-Cost Strategies For Running AI In The Cloud

AI costs can spiral out of control before you know it. One day you’re building an AI feature that promises to bring in a solid chunk of revenue for the company. The next day you’re obsessing over an astronomically high cloud bill that will significantly eat into your profits — or consume them entirely. To help you solve this problem, we brought in Jeremy Daly, Director of Research (and AI cost management guru) at CloudZero.