Operations | Monitoring | ITSM | DevOps | Cloud

Troubleshoot faster with process-level app and network data

When responding to an incident, you need to quickly find the scope of the issue so you know which teams to notify and which parts of your system to investigate next—before your end users are affected. But as multiple processes use resources on each of your hosts, and interact in unexpected ways, it can be difficult to know exactly what is causing an issue—especially if those processes are running off-the-shelf software.

PD Summit21: Transforming Infrastructure Teams Through Observability

What is this ""observability"" thing that everyone is talking about? Observability allows you to navigate the dark unknowns with echolocation while others attempt to fly blindly without it. Are your dashboards all green, but you still have an issue brewing? Do you need instant feedback based on the Core Analysis loop? Are your engineers tired of waking up at 3 AM for the expected issues? Is there a lack of time for experimentation? Generate your own answers and create a meaningful course of action with observability.

PD Summit21: The Netflix Reliability Story: A Brief History of How We Evolved Resilience to Failure

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. We strive to provide a service that people love and can enjoy anytime, anywhere. An important foundation for bringing our customers joy is a strong focus on reliability that ensures Netflix will be available when they need it. In this talk, I’ll tell the story of how we've grown our reliability practices over time to meet the changing demands of microservices and distributed computing.

PD Summit21: Adopting and Maturing to Service Ownership with PagerDuty and Rundeck

Among the common goals of today's engineering and operations teams is to adopt a culture of service ownership: ""You build it, you own it."" As with many ancillary objectives to driving DevOps across an organization, this is easier said than done. Sometimes this is in small part due to the technology stack/architecture of a given company. But more often than not, this is because teams lack the human-to-technology mechanisms that allow for a culture of service ownership.

eG Enterprise, the virtual assistant that every Citrix Admin needs

eG Enterprise is the virtual assistant, who’ll make your life a whole lot easier. Just like Siri and Alexa, eG will proactively monitor your IT & applications. Wouldn’t you want to know what these extra sets of hands can deliver? Watch this short video to know how automatic root-cause diagnosis tech, Citrix service topology views, synthetic & real user monitoring capabilities, and machine learning and auto-baselining tech enable you to be the IT hero among your peers, colleagues, and the management.

Creating Envoy WebAssembly Extensions

In the CNCF ecosystem, Envoy, an open source service proxy developed by Lyft, is a very common choice in service mesh networking. In a previous post we discussed that both Consul and Istio leverage Envoy. Were you aware that you can extend Envoy’s capabilities with WebAssembly? What is WebAssembly? WebAssembly, or Wasm as it is often abbreviated, is not so much of a programming language as it is a specification for a binary instruction format that can be run in sandboxed virtual machines.

How to encourage DBAs to embrace DevOps, rather than fear change

How do we help Database Administrators (DBAs) embrace DevOps in a way that can be really productive and part of a rich DevOps team that delivers value to customers quickly and continuously? That’s an important question to ask right now because there’s a common view among DBAs that DevOps isn’t for them. They’re responsible for documentation and maintenance and deployments, they have internal customers, and they serve internal requests.

5 things you can do to improve your customer support (part 2)

From my previous blog, I’m going to continue the list of five things you can do to improve your technical service delivery to your customers (if you didn’t read the last post, you can catch up on what you missed here (link)). In the following three points, I focus on the role automation can play.

Introduction to Custom Metrics in Python with the Logz.io RemoteWrite SDK

We just announced the creation of a new RemoteWrite SDK to support custom metrics from applications using several different languages. This tutorial will give a quick rundown of how to use the Python SDK. Using these integrations, Prometheus users can send metrics directly to Logz.io using the RemoteWrite protocol without sending them to Prometheus first. Each SDK, while for a separate language, is each capable of working with frameworks like Thanos, Cortex, and of course M3DB.