Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Solve financial services ITOps challenges with AIOps

The financial services industry is experiencing a profound shift. Customers now demand a flawless experience across all touchpoints, including online platforms, mobile devices, ATMs, and physical branches. Any lapse in performance or reliability in these channels can lead to dissatisfaction. Moreover, the competition is intensifying as technology-focused companies, more nimble and innovative than traditional counterparts, are continuously disrupting the market.

DORA vs. DORA!

There was recently some confusion in the office that I thought was worth researching and addressing. Depending on who you are talking to, you may hear the acronym DORA in one of two contexts. (OK, three if you’re talking to a preschooler!) It might be in relation to DORA metrics–that is, a set of metrics associated with DevOps Research and Assessment.

Trade-off Between Reliability and Feature Velocity

The pressure to constantly innovate and release new features can often clash with the need for a stable and reliable product. While there might be some temporary cutbacks in testing time to achieve high feature velocity, ensuring reliability doesn't have to be an afterthought. We reached out to industry experts to gather their insights on ensuring reliability during phases that demand high feature velocity. Here's what they had to say.

Navigating the Evolving Landscape: A Deep Dive into REST API Versioning Strategies

In the ever-evolving landscape of APIs, ensuring seamless interactions and managing changes becomes crucial. While innovation and adaptability are essential, maintaining backward compatibility is equally important to avoid disruption for existing users. This is where REST API versioning comes into play. Versioning allows you to introduce new features or changes to your API in a controlled manner, while simultaneously keeping older versions running smoothly.

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

Combating IT Alert Fatigue

With the growing complexity of IT systems, managing alerts and notifications without succumbing to the crippling effects of alert fatigue has never been more challenging. Alert Fatigue occurs when the volume of notifications makes it impossible to discern signal from noise, desensitizing the recipient to warnings, some of which end up representing critical issues.

Finally: alerting and on-call scheduling for how you actually work

TL;DR You deserve a better alerting and on-call tool. So we built Signals. In our early days, we often used the tagline, “You just got paged. Now what?” It encapsulated how FireHydrant solved for all of the messy bits that come after your alert is fired, from incident declaration all the way through to retrospective. At the time, we saw alerting and on-call scheduling as a solved problem.

Integrating Prometheus AlertManager with PagerDuty in Calico

In the fast-paced world of Kubernetes, guaranteeing optimal performance and reliability of underlying infrastructure is crucial, such as container and Kubernetes networking. One key aspect of achieving this is by effectively managing alerts and notifications. This blog post emphasizes the significance of configuring alerts in a Kubernetes environment, particularly for Calico Enterprise and Cloud, which provides Kubernetes workload networking, security, and observability.

Start Monitoring Third-Party Outages in Opsgenie

In today's digital world, we rely a lot on third-party services. These services are great because they help us grow, be more flexible, and work more efficiently. However, they also make things more complicated and risky. If a service we depend on stops working, it can cause big problems. To deal with this, we're excited to introduce a new feature that connects Opsgenie with IsDown.

Balancing Innovation and Reliability: A Guide for SRE Teams

In today's rapidly evolving technological landscape, striking a balance between innovation and reliability is a constant challenge for Site Reliability Engineering (SRE) teams. On one hand, businesses and customers crave the constant stream of new features and functionalities that fuel progress. On the other hand, ensuring system stability, minimal downtime, and optimal performance remains paramount for user experience and business continuity.