Operations | Monitoring | ITSM | DevOps | Cloud

CheckMK and Enterprise Alert - a scripted heartbeat check

A few days ago I received an inquiry about a scripting problem from one of our longtime partners, to be exact our DCP Marc Handel from IT unlimited AG. In the exchange with Marc I realized that his idea to use the Enterprise Alert Scripting Host, the Windows Task Scheduler and CheckMK to realize a roundtrip monitoring could be interesting for the whole community. Especially for all our CheckMK customers.

Building a fault-tolerant content API using Strapi and FlashDrive

Strapi is a very popular headless CMS and you probably have heard very good things about it. Here at FlashDrive we love it and use it everyday! On FlashDrive.io, Strapi is available as a one-click installation product inside the FlashDrive marketplace. In this tutorial, you will learn how to install Strapi on FlashDrive, create our first content and publish it!

Creating problem-solving partnerships through a policy of open innovation

The world is full of problems. Any company trying to make a name for itself in the world is going to run right smack into those problems. But the world is also full of solutions. To better find and profit from those solutions, companies are increasingly embracing open innovation, an approach to solving problems in creative and unexpected ways by collaborating with customers, partners, and employees.

Monitoring Amazon cloudfront with Graphite via Graphite APIs

MetricFire offers a complete system, infrastructure, and application monitoring using a suite of open-source monitoring tools. With MetricFire, you can monitor all your infrastructure on a single dashboard. The platform displays metrics on the dashboard using either Hosted Prometheus or Graphite-as-a-Service.

How Lowe's SRE reduced its mean time to recovery (MTTR) by over 80 percent

The stakes of managing Lowes.com have never been higher, and that means spotting, troubleshooting and recovering from incidents as quickly as possible, so that customers can continue to do business on our site. To do that, it’s crucial to have solid incident engineering practices in place. Resolving an incident means mitigating the impact and/or restoring the service to its previous condition.

The selling doesn't stop once the contract is signed

I have a long-time N-able partner whose account I managed off and on over the years. Although I am no longer in that role, we still keep in touch, chatting regularly about how their business is doing, and discussing their successes or any challenges they might currently be facing. This year, they set some pretty aggressive growth targets for their organization. Their revenues were off due to the pandemic, so they needed to regroup and double-down to make 2021 a more profitable year.

Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

One of the tools we use internally at Squadcast for SLO and Error Budget tracking is now open-source. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker. We made this open-source so that the SRE community can also use it too. Looking forward to get your feedback, suggestions and patches :)