Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Is your online gaming platform "Chaos Monkey"-proof?

Try to imagine a bunch of monkeys running around your data center, pulling cables, trashing routers and wreaking havoc on your applications and infrastructure. Ever more crucial in these days of heated competition between online gaming operators, is player experience. Continuity of operations is “Uber-Alles” and avoiding churn, due to service disruption, is the organizational mantra.

Grubhub and JPMC Shift Reliability Testing Left at Chaos Conf 2020

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Gremlin’s Chaos Conf is always an exciting event, bringing together leaders at the forefront of Chaos Engineering practices. This year was no exception, moving beyond defining Chaos Engineering to more advanced adoption and best practices discussions.

ObservabilityCON Day 4 recap: a panel discussion on observability (and its future), the benefits of Chaos Engineering, and an observability demo showcase

Over the past four days, Grafana Labs' ObservabilityCON 2020 brought together the Grafana community for talks dedicated to observability. We hope you enjoyed all of the sessions, which are available on demand now. (Link to them from the schedule on the event page). The conference wrapped up with predictions and advice from observability experts, lessons in failure, and Grafana Labs team members showcasing ways Grafana and other tools fit into an observability workflow.

Looking back on Chaos Conf 2020

It’s already been a week since we closed our third annual Chaos Conf! While we were forced to take the conference online, this meant that more of you could join us. Over 3,500 people signed up to help make this the world’s largest Chaos Engineering conference. That’s 5x more than 2019, and nearly 10x more than 2018! This is a testament to the growth of Chaos Engineering as a practice across many different industries and around the world.

Is your microservice a distributed monolith?

Your team has decided to migrate your monolithic application to a microservices architecture. You’ve modularized your business logic, containerized your codebase, allowed your developers to do polyglot programming, replaced function calls with API calls, built a Kubernetes environment, and fine-tuned your deployment strategy. But soon after hitting deploy, you start noticing problems.

Technology Business Management and Chaos Engineering

Get started with Gremlin’s Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Technology Business Management (TBM) is a decision-making tool that helps organizations maximize the business value of information technology (IT) spending by adjusting management practices. With TBM, IT is transformed to run like a business instead of merely a cost center.

Understanding your application's critical path

Don’t wait for an incident to focus on reliability. Learn concrete steps for preventing incidents in the first place in our two-part series, Planning and Architecting for Reliability. It’s 3 a.m. You’re lying comfortably in bed when suddenly your phone starts screeching. It’s an automated high-severity alert telling you that your company’s web application is down. Exhausted, you open the website on your phone and do some basic tests.

Client-side chaos: Making your front end more reliable

Get started with Gremlin’s Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. The concept of Chaos Engineering is most often applied to backend systems, but for teams building websites and web applications, this is only half of the story.

Announcing Shared Scenarios to Promote a Culture of Reliability

Get started with Gremlin’s Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Today, Gremlin is excited to announce the ability to share a Scenario across your entire organization. This allows you to build up a library of reliability exercises that are customized to your company’s applications and technology.

What your company can learn from the Bank of England's resilience proposal

Learn how to modernize your financial systems with confidence while mitigating risk (and meeting compliance). This article was originally published on TechCrunch. The outages at RBS, TSB, and Visa left millions of people unable to deposit their paychecks, pay their bills, acquire new loans, and more.