Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

How Puppet supports desktop and laptop automation in a changing world

The world has changed since I started out on a help desk in Colorado 25 years ago. In those long ago years, a company’s desktop machines actually lived under the desks of many in the organization (and often doubled as a foot warmer!) and configuration was done machine by machine manually, or maybe even by some script that was created to run at login if we were lucky. If there were laptops in use by the business users, they were a lot less mobile and rarer than in today’s business world...

Podcast: Break Things on Purpose | Tomas Fedor, Head of Infrastructure at Productboard

Tomas Fedor, Head of Infrastructure at Productboard, is here to talk about his personal passions and professional perfections. Tomas takes us through some of his biggest adaptations he had to make when adopting the cloud. He also tackles the complexities of working through his POC process, and how to keep consistencies across various teams. Teams are a central focus for Tomas as well, and his techniques and experiences in growing and leading specific technical teams is insightful.

Play with the Speedscale - no registration required

For the first time, Kubernetes engineering teams interested in learning more about Speedscale will be able to play with the framework without registering, at play.speedscale.com. Engineers can see firsthand how you: While users won’t be able to actively watch replays run, there are a variety of pre-created traffic snapshots, reports and configs to browse. Engineers will be able to experience the ease with which snapshots are generated for fast, scalable test automation.

How to Write Meaningful Retrospectives

One of the foundations of incident management in SRE practice is the incident retrospective. It documents all the learnings from an incident and serves as a checklist for follow-up actions. If we step back, there are 7 main elements to a retrospective. When done right, these elements help you better understand an incident, what it reveals about the system as a whole, and how to build lasting solutions.

5 ways incidents made me a better engineer

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems. In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team.

TensorFlow Python Code Injection: More eval() Woes

JFrog security research team (formerly Vdoo) has recently disclosed a code injection issue in one of the utilities shipped with Tensorflow, a popular Machine Learning platform that’s widely used in the industry. The issue has been assigned to CVE-2021-41228. This disclosure is hot on the heels of our previous, similar disclosure in Yamale which you can read about in our previous blog post.

Remote Server Management Guide: What Is It and How It Works

Remote server management is a proven strategy used for increasing the uptime and responsiveness of your IT infrastructure. It manages the performance, health, and utilization of remote servers or back-end systems on various networks. After reading this post, you’ll understand what remote server management is, how it works, and how to implement it.

Terraform and Shipa 101 - Your First Terraform and Shipa Cloud Integration

Leveraging Terraform, which is an infrastructure-as-code platform, is a great match. Using both technologies together is becoming more mature and there have been some great pieces around the art of the possible between the two platforms. Though if you are unfamiliar with both, this guide will get you up and started with both Terraform and Shipa together. In this example will be using Terraform to create all of the necessary Shipa resources to deploy to a Kubernetes cluster.

SRE Principles: The 7 Fundamental Rules

In one of our previous articles, we discussed what an SRE is, what they do, and some of the common responsibilities that a typical SRE may have, like supporting operations, dealing with trouble tickets and incident response, and general system monitoring and observability. In this article, we will take a deeper dive into the various SRE principles and guidelines that a site reliability engineer practices in their role.