Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Understanding and Baselining Network Behaviour using Machine Learning - Part I

Managing a network more effectively has been something our customers have been asking us about for many years, but it has become an increasingly important topic as working from home becomes the new normal across the globe. In this blog series, I thought I’d present a few analytical techniques that we have seen our customers deploy on their network data to: Better understand their network and Develop baselines for network behaviour and detect anomalies.

Understanding and Baselining Network Behaviour using Machine Learning - Part II

A difficult question we come across with many customers is ‘what does normal look like for my network?’. There are many reasons why monitoring for changes in network behaviour is important, with some great examples in this article - such as flagging potential security risks or predicting potential outages.

Colonel Mustard in the Library with Microservices APM

As many of us are rediscovering an interest in board games, it feels relevant to make reference to Hasbro’s classic Clue. Understanding what’s going right or wrong in your sprawling digital business can feel a lot like a murder mystery: it was the authentication service in the east region with the memory exhaustion error. This analogy has a weakness when applied to modern operations. The Clue board game had 6 weapons, 6 suspects, and 9 rooms. That’s 324 combinations.

Sharing Code Dependencies with AWS Lambda Layers

The use of Serverless execution models is expanding extremely rapidly and cloud providers are continuing to enhance their platforms. Per Flexera’s “State of the Cloud” report: Leading this trend for the last two years, Amazon has released a few features that address AWS Lambdas’ pain points and make them a more feasible choice for large scale deployments consisting of numerous applications.

Visualizing observability with Kibana: Event rates and rate of change in TSVB

When working with observability data, a good portion of it comes in as time series data — things like CPU or memory utilization, network transfer, even application trace data. And the Elastic Stack offers powerful tools within Kibana for time series analysis, including TSVB (formerly Time Series Visual Builder). In this blog post, I’m going to attempt to demystify rates in TSVB by walking through three different types: positive rates, rate of change, and event rates.

How to design your Elasticsearch data storage architecture for scale

Elasticsearch allows you to store, search, and analyze large amounts of structured and unstructured data. This speed, scale, and flexibility makes the Elastic Stack a powerful solution for a wide variety of use cases, like system observability, security (threat hunting and prevention), enterprise search, and more. Because of this flexibility, effectively architecting your deployment’s data storage for scale is incredibly important.

Modern ITSM Solutions: Flexibility in Incident Response

We no longer live in a world where a few tools determine the way organizations structure their processes. From IT Service Delivery to Incident Response, Modern IT Operation Solutions need to embody the flexibility that most Enterprises require. The dynamic ITOps ecosystem has shifted to put choice back in the hands of the user. Now, IT Solutions must follow suit. Modern Incident Response platforms, in particular, need the flexibility that enterprises need to mirror their enterprise architecture.

Why Serverless Apps Fail and How to Design Resilient Architectures

We’ve been monitoring 100,000’s of serverless backend components for 2+ years at Dashbird. In our experience, Serverless infrastructure failures boil down to: These isolated faults become causes of failure due to dependencies in our cloud architectures (ref. Difference of Fault vs. Failure). If a serverless Lambda function relies on a database that is under stress, the entire API may start returning 5XX errors.

Complexity Debt: The Hidden Wrench In Enterprise Modernization

There is a dark side to digital transformation, but nobody wants to talk about it. I previously wrote about technical debt. But there’s also complexity debt. When a CIO decides to delay modernizing or upgrading systems, there are usually budget considerations and skills gaps that stand in the way. The IT leader’s job is one of continual evaluation of risk and opportunity amid rapid technology disruption.

Advice for On-call Teams During COVID-19

I’ve offered some tips up for folks who are oncall during the COVID-19 crisis, but I thought it would be helpful to get some more ideas from people with different perspectives. So I reached out to some people I trust to see what they had to say. They all have different viewpoints, but some themes emerge, like managing alerts, having empathy, and practicing self-care. The participants, in alphabetical order: Aaron Aldrich is a Developer Advocate at LaunchDarkly, with a focus on DevOps.