Operations | Monitoring | ITSM | DevOps | Cloud

Incident Review - Google Outage

When something as ubiquitous as Google goes down, there is a lot of online frenzy with users tweeting and searching for updates on the issue. That’s exactly what we witnessed today between 9/24/2020 17:59:44 PST to 9/24/2020 18:23:20 PST. Multiple Google services like Mail, Drive, Meet, Hangouts experienced downtime. Frustrated users took to Twitter to report the outage and the tweets were captured by Websee. Users trying to access Google services got a 502 error screen.

Harnessing the Transformative Power of Disruption

They say that necessity is the mother of invention. I believe that the aphorism has a business corollary: Disruption is the mother of transformation. I’ve seen it prove out over and over again throughout my career. In fact, I’d even go one step further to say that lack of disruption can actually stand in the way of successful change—and I have the scars to prove it.

Leveraging logs to better secure cloud-native applications

With the growing popularity of cloud computing, security incidents related to it have been on the rise. Logs are indispensable resources for countering these threats, and they can be utilized for alerting, taking remedial action, and even preventing future attacks. In this post, we will examine ways to better secure cloud-native applications using logs.

Americaneagle.com and ROC Commerce stay ahead with Retrace

As a digital agency, the last thing you need are production issues for your ecommerce clients. The stakes are even higher when your ecommerce clients are running Super Bowl ads for millions to see. Instead of enjoying the game, you are faced with troubleshooting a dumpster fire. The development teams at Americaneagle.com and ROC Commerce rely heavily on Application Performance Management (APM) tools, especially on high stakes game days.

Any PLC alarm on your mobile device

Maintenance of machines is an incredibly important task. And it is important to fix a machine before it completely fails. In reactive maintenance scenarios, speed of response is key. Once an issue is detected is important to communicate as reliably and quickly as possible to the right engineer. Ideally, the machine is connected directly to team of mobile engineers in charge and can let them know what exactly happened and what needs to be fixed.

The incident resolution mandate of telehealth and telepharmacy providers in the age of Covid-19

The incident management challenges of a pandemic-driven world & how to overcome them “While the safety and well-being of workers affected by COVID-19 is the first priority, companies will also triage other essentials, such as incident management and stakeholder communications.” (PWC) In a pandemic-stricken world that is consuming products and services over the internet, more than ever, there is a great strain on digital and connectivity systems.

Best PagerDuty Alternatives of 2020: An Independent Review by StatusGator

Modern applications offer more and more features, and the infrastructure needed to run them becomes increasingly complex. The need for Application Performance Monitoring (APM) and Network Performance Monitoring (NPM) tools like PagerDuty is obvious, as the cost of downtime can be exorbitant for a business of any scope. Thus, every business needs to use Pager Duty or one of its alternatives that alerts the Ops team should anything go awry.

How I'm using Grafana and Prometheus to monitor my 3D printing

My name is Jonathan Stines, and I am a Penetration Tester for Rapid7, a cybersecurity company located in Austin, Texas. A small handful of my former colleagues at Rapid7 now work at Grafana Labs and have said it was a pretty cool spot to have landed. I had a vague understanding of what Grafana was, but what really struck my interest was when I saw their sweet dashboards in the HBO series Silicon Valley.