Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

FAANG proofing your Job Applications

There is one thing that hurts more than being rejected by a hiring manager - being rejected because you’re not ex-FAANG. This was not always the case though - FAANG’s combined engineering workforce is currently at 330,000+ and growing at an astounding 20% YoY. This means that at any given point in time, there are tens of thousands of FAANG engineers active in the job market vying for spots in great up-and-coming companies.

Escalating Prometheus alerts to SMS/Phone/Slack/Microsoft-Teams via AlertManager and Zenduty

Prometheus is by far, one of the most popular open-source monitoring tools used by millions of engineering teams globally with a robust community and continued adoption and evolution. We at Zenduty shipped our Prometheus integration integration a while back and we’re happy to report that the adoption of our Prometheus integration has been absolutely through the roof!

Site reliability engineering-what is SRE?

As companies today are racing to build site reliability engineering(SRE) practices within their engineering teams, site reliability engineering has become one of the hottest and highest paying jobs in tech. Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional.

Difference between a team lead and an engineering manager and how to transition between these roles

Transitioning from a team lead role to an engineering manager role is tough and you will experience many changes when transitioning between these two roles. What happens when you become an engineering manager?

The difference between Event Logging and Tracing in Observability

I have been noticing that a lot of folks are often confused between event logging and tracing. In terms of building out a generic SD for devs to report on observability data, should Event APIs be distinct from Trace APIs? Is an Event just a single Trace Span ? If you look at Honeycomb’s implementation, an Event seems to be equivalent to a single span trace. The middleware wrapper creates a Honeycomb event in the request context as a span in the overall trace.

Attaching incident playbooks to Azure monitor alerts for rapid remediation

Incident response playbooks are a set of actions that need to be executed by your incident repsonders depending on the nature of the outage. Having well defined incident response playbooks can be extremely critical, especially during high customer impact events, that you would typically classify as Sev-0 incidents.

On-call compensation models

Providing customers with a world-class and seamless user experience is critical for the success of any business. It is therefore important that you have a robust on-call strategy that optimizes the availability of the right subject matter experts, on-call engineers, and support engineers to resolve critical, user-impacting incidents as soon as possible.

It's a known issue - How Product Managers should deal with issue or feature related enquiries or feedback

I often hear folks in my network being triggered by interactions with product managers within their companies whenever they follow up on certain product-related issues. The triggering phrase invariably is “It’s a known issue”. And they often wonder, well if it’s a known issue, why on earth isn’t anything done about it?

How to build a customer advisory board

Regardless of where you are in your product journey, it is impreative that you constitute a customer advisory board who can share perspectives into their business challenges so that you can gain insights on how to shape our road map, develop new features, formulate your vision and give you constant feedback on your product. So, how many customers should to include in a customer advisory board? Should you target higher level stakeholder or individual users?

Defining your Sev-1s

One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks like. “Website doesn’t work” is certainly no enough. “Website is up but a key resource (ie CSS file) is missing, rendering the website unusable” is still not enough. “A single page on the website is 404’ing” is not a major but could be a minor incident.