Operations | Monitoring | ITSM | DevOps | Cloud

The customer service imperative: Digital operations and engagement

Without question, the largest disruption the customer service market has ever experienced happened at the beginning of the pandemic. Millions of customer service agents across the globe were sent home overnight, causing major disruptions for companies that were too reliant on manual processes and tribal knowledge. Agents didn’t have the necessary tools in this new remote environment, and customers experienced unprecedented wait times as a result, with some requests never being answered.

Speed up your dashboard workflow with dynamic template variable syntax

Template variables enable you to use tags to filter your Datadog dashboards to the hosts, containers, or services you need for faster troubleshooting. However, there are some cases where it may be difficult to use a standard set of template variables to aggregate all of the data you need without creating a complicated, difficult to manage set of variables. For example, you may use tag values that are a subset of another tag.

eBonding Integration: ServiceNow Incidents to 5 Destinations: PagerDuty, Twilio, Slack, ElasticSearch/Kibana and Email

In this blog, we will walk through the scenario of sending or E-bonding ServiceNow incidents to 5 destinations simultaneously, using Robotic Data Automation and AIOps Studio. E-bonding refers to a scenario where data is delivered (one-way) or synchronized (two-way) between two or different systems, which are typically under different administrative boundaries. E-Bonding term originally appeared in Service Provider and Telco space (see: ATT E-Bonding).

Concrete Steps to Reducing MTTR

In today’s data-centric world, metrics or numbers define all performance benchmarks. The time between when an event starts and ends shows how well a system can handle and process such events. One of such metrics is MTTR. MTTR usually stands for Mean Time To Resolution, but it has held several meanings over the years. MTTR is a metric used to measure how well a system can bounce back from errors and provide long-lasting solutions.

Digging into AWS Fargate runtime security approaches: Beyond ptrace and LD_PRELOAD

Fargate offers a great value proposition to AWS users: forget about virtual machines and just provision containers. Amazon will take care of the underlying hosts, so you will be able to focus on writing software instead of maintaining and upgrading a fleet of Linux instances. Fargate brings many benefits to the table, including small maintenance overhead, lower attack surface, and granular pricing. However, as any cloud asset, leaving your AWS Fargate tasks unattended can lead to nasty surprises.

No-code Lambda Monitoring

Auto-instrumenting Lambda Monitoring didn’t originate through a focus group or business plan. It started as a hackathon project in which our growth team used Cloudwatch to build a prototype that could instrument Lambda functions with Sentry. We did this by using Cloudformation’s stack to automatically create resources in a customer environment while streaming CloudWatch Logs to Sentry through the Kinesis Firehose.

A guide for CTO: 8 questions to ask before using Kubernetes

Congratulations, you finally consider moving your apps to Kubernetes. It is a big day! Here is a checklist to ensure you did not forget anything essential to increase your chances of success using Kubernetes. We divided those points into three sections, from the most important to the least. Let’s go.

How shuffle sharding in Cortex leads to better scalability and more isolation for Prometheus

For many years, it has been possible to scale Cortex clusters to hundreds of replicas. The relatively simple Dynamo-style replication relies on quorum consistency for reads and writes. But as such, more than a single replica failure can lead to an outage for all tenants. Shuffle sharding solves that issue by automatically picking a random “replica set” for each tenant, allowing you to isolate tenants and reduce the chance of an outage.

Observability: It's the User Experience, Stupid!

Observability, which originated from control theory, measures how well you can understand a system’s internal states from its external outputs. Observability uses instrumentation to provide insights that aid monitoring. In DevOps, gaining observability is achieved through a set of monitoring solutions. The shift to use one vendor platform to do so, versus multiple solutions, make sense as.