Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Security Log Monitoring and DNS Request Analysis

Monitoring all DNS requests in your network, including those that were blocked by (e.g., by a firewall) is a great way to increase visibility, enforce compliance and detect threats. A common problem with collecting DNS logs is that DNS server logs are notoriously hard to parse. Also, parsing only the logs of your DNS servers leaves a blind spot when it comes to usage of, or the attempt to use, an external DNS server like Google's 8.8.8.8.

SNMP vs WMI: the advantage of less resource consuming monitor types

WMI (Windows Management Instrumentation) is an actual standard to access and/or control Windows components, services and applications. With its query language (resembling the SQL used by many relational databases), WMI allows collecting information from multiple sources, so-called providers. However, this comes at a cost: running WMI query is a resource- and time-consuming operation (comparing to certain alternatives).

What is Subkeying?

Subkeying is a way to group a set of crashes at some level other than the top level of the call stack. Subkeying is a way to group a set of crashes at some level other than the top level of the call stack. At BugSplat, crashes are grouped by a stack key and groups of crashes can be found on the Summary page. By Default, BugSplat groups crashes using the topmost level of a call stack. A subkey is created when crashes are grouped at a level other than the top level of a call stack.

Introducing the Datadog Operator for Kubernetes and OpenShift

As more environments run on Kubernetes—including our own— Datadog has been making it easier to get visibility into clusters of any scale. To minimize load on the Kubernetes API server, the Datadog Agent runs in two different modes. The node-based Agent queries local containers or external endpoints for data, while the Cluster Agent fetches cluster-level metadata from the API server.

How to identify and resolve front-end performance bottlenecks

We all want lightning-fast websites and applications, but how do we prioritize our efforts in order to have the biggest impact on performance? We interviewed our own front-end team so we could share some best practices we use every day to improve and maintain the performance of Raygun.

Redis monitoring 101: Metrics to watch

Redis, which stands for Remote Dictionary Server, is an open source, in-memory data structure store that’s used as a database, memory cache, and message broker. It stores data entirely in memory in the form of key-value pairs. This gives it an edge over all other databases, as it eliminates the need to access data from the disk. It also makes Redis one of the fastest NoSQL databases, where data is accessed in microseconds because there are no seek time delays.

Grafana 7.0 preview: New image renderer plugin to replace PhantomJS

Many Grafana users export images of their dashboard panels. This feature powers the ability to receive alerts with a rendered image of the panel attached, which is valuable for quickly spotting if something is about to go sideways in production. Since Grafana v2.0, when support for server-side rendering of dashboard panels as images was introduced, PhantomJS has served as the built-in image renderer that enables this feature.

How to deploy an app to AWS: App security

AWS security is an ongoing battle that you must address during every release, every change, and every CVE. When you’re first launching your production application, it’s impossible to check all the boxes; you simply don’t have the time. Until your application gets more adoption, you only have the time to do the bare essentials of security.

Sysdig's Prometheus monitoring behind the scenes

A few weeks ago, we announced that Sysdig is offering fully compatible Prometheus monitoring at scale for our customers, as well as a new website called PromCat.io hosting a curated repository of Prometheus exporters, dashboards and alerts. This got me thinking about how we were actually able to implement the changes necessary to offer this in our platform.

Best Practices in Incident Management

In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime. Incident Management processes are typically used by SRE, DevOps, NOC and other IT teams to respond to incidents that affect services and work on restoring their uptime.