Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Sponsored Post

What are Runbooks? And why are they needed?

Imagine being an Ops engineer in a team just struck by tragedy. Alarms start ringing, and incident response is in full force. It may sound like the situation is in control. WRONG! There's panic everywhere. The on-call team is scrambling for the heavenly door to redemption. But, the only thing that doesn't stop - Stakeholder Inquiries. This situation is bad. But it could be worse. Now imagine being a less-experienced Ops engineer in a relatively small on-call team struck by tragedy. If you don't have sufficient guidance, let alone moral support- you're toast.

What are Canary Deployments and Why are they Important?

Every modification to software comes with the potential for production problems. Application failures often have serious consequences which can result in a loss of revenue and a poor customer experience. Additionally, organizations constantly try to improve their services for a better customer experience. How can you minimize the chance of error and update your application with confidence?

Uptime + Squadcast Integration: Routing Alerts Made Easy

Uptime is a site monitoring solution used to reach various endpoints & notify users via push notifications when downtime is detected. It collects and stores downtime & response time data & which is then made available as reports to the users. If you use Uptime for your monitoring needs, you can now integrate it with Squadcast to route detailed alerts from Uptime to the right users in Squadcast. The below steps will help you set up Uptime and Squadcast integration.

Anti-patterns in Incident Response that you should unlearn

It is important to invest time and effort in understanding why a system performs the way it does and how we can improve it. Companies continue with practices that yield successful results, but ignoring anti-patterns can be far worse than choosing rigid processes. In this blog we will explore anti-patterns in incident response and why you should unlearn those.
Sponsored Post

Classifying Severity Levels for Your Organization

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively. Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity and priority can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

Amazon OpenSearch + Squadcast Integration: Routing Alerts Made Easy

Developers often find comfort in embracing open-source software for numerous reasons. One of the most important reasons is the freedom to use that software anywhere and how they wish to. Amazon OpenSearch is an open-source search and analytics suite derived from Elasticsearch. It lets you perform interactive log analytics and real-time application monitoring with ease.
Sponsored Post

Top Five Pitfalls of On-Call Scheduling

On-call schedules ensure that there's someone available day and night to fix or escalate any issues that arise. Using an on-call schedule helps keep things running smoothly. These on-call workers can be anyone from nurses and doctors required to respond to emergencies to IT and software engineering staff who need to fix service outages or significant bugs. Being on-call can be challenging and stressful. But with the proper practices in place, on-call schedules can fit well into an employee's work-life balance while still meeting the organization's needs.

Distributed Caching on Cloud

Distributed caching is an important aspect of cloud based applications, be it for on-premises, public or hybrid cloud environments. It facilitates incremental scaling, allowing the cache to grow and incorporate the data growth. In this blog we will explore distributed caching on cloud and why it is useful for environments with high data volume and load.

Kubernetes as a Service using Amazon EKS

Kubernetes is an open source software that helps you manage and deploy your containerized applications. Kubernetes consists of two major component control planes (which decide where to run your pod) and a worker node (where your workload runs). As Kubernetes is a complex system, managing these components is challenging, and this is where you can use a Kubernetes as a Service solution like Amazon Elastic Kubernetes Service (EKS).

Setting up Route 53 Health Checks

We live in an age where the internet and digital data drive modern day markets, which results in huge amounts of data being generated and consumed. Hence, it has become very important for online platforms to manage this traffic and serve their customers more efficiently. In this blog we will explore the Amazon Route 53 service and see how it addresses domain name system routing and health check problems.