Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Preventing Outages in 2023

The outages span the giants of the Internet and some of the biggest failures of IT resilience we were subject to – from AWS’s trifecta of outages in December 2021 to the October ‘21 outage that took down Facebook, Instagram, WhatsApp, and interrelated services. We also look at some more intermittent outages that you may have missed.

SRE Report 2023: Findings From the Field - Toil

Toil. Few other words have the same visceral impact for SREs as their four-letter nemesis: toil. Although pretty much everyone recognizes and agrees that toil is bad, it is a term that is frequently misused in colloquial use. In common English usage, toil is defined as “long strenuous fatiguing labor”. As a term of art in the SRE profession, “toil” has several very specific characteristics which distinguish it from other sorts of work which people spend time on.

'Preventing Outages in 2023: What We Can Learn from Recent Failures' Provides Analysis of Internet Failures and Key Learnings

New white paper from Catchpoint provides in-depth analysis of key Internet outages across the past 18 months, from AWS to Facebook; includes six critical lessons for IT teams to improve Internet Resilience.

Microsoft Cloud Outage Causes Global Workforce Disruptions

Many of us (indeed 1 billion plus users worldwide) rely on Microsoft for essential work activities and were impacted yesterday (Wednesday January 25, 2023) when the cloud service provider experienced a prolonged outage. Internet Resilience is a business priority because when critical workforce services like Microsoft go down, global teams are hugely disrupted.

How Much Does That Minute Cost?

Network outages are both common and expensive – usually far more expensive than people realize. Yes, the network is down and the organization is losing money, but do you really appreciate how much money? And how much an outage can actually cost on a per minute basis? It’s not only more than most people think, it’s something that can be mitigated fairly easily.

Catchpoint Announces the World's First Complete Solution to Monitor and Protect the Internet's Leading Companies from BGP Incidents in Seconds

Catchpoint's Internet Performance Monitoring Platform helps IT teams identify and mitigate BGP incidents, including hijack attempts and routing issues, with the industry's broadest network of vantage points in the world drawing on real-time BGP monitoring.

SRE Report 2023: Are we Aligned? Yes. No. Maybe.

Each year of the SRE Report, there’s a trend or anti-pattern that leaps out and makes us pause and reflect. Last year, for example, we found a huge drop in global toil levels. With the whole world working from home for a full year, it made sense that global toil levels would drop, right? But this year, despite the great reopening underway, toil levels dropped even further - it's a paradox, one which no doubt will require its own scrutiny.

How Catchpoint's IPM Platform Detected Amazon's Two-Day Search Issue

Not all Internet outages take a website down. Some may impact a smaller subsection of users or only affect one part of a site’s functionality. Moreover, because of their relative “hidden” nature, organizations may not always know about them immediately since fewer users will be making complaints. However, such incidents can still have serious consequences, thus you want to detect them as soon as possible so you can quickly mitigate and resolve issues.

What is Internet Performance Monitoring and How is it Different from APM?

Most Internet-centric organizations today use some form of APM tools, as they should. But they are insufficient. Over the last ten years, the world has completely changed. If you think about it, in the first decade of this millennium, most businesses had an Exchange server, maybe Siebel CRM, a file share, and a range of other business apps, usually hosted in the same building. Everything was on the LAN. Today, it is the exact opposite. Everything is distributed.