Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

How Many SREs Does Your Company Need? Here's How to Decide

So you’ve decided to take advantage of Site Reliability Engineering by hiring SREs for your company. Now, you have a second decision to make: Exactly how many SREs to hire. Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs? The answers to these questions will, of course, vary; every business’s needs are different.

Automate Troubleshooting of Applications Running on Kubernetes

StackState is an out-of-the-box solution to observe your entire Kubernetes stack, identify problems, automatically highlight the changes that cause them and provide the full context you need for efficient and effective troubleshooting. Our clear and affordable pricing makes it easy to get started today.

Announcing issue-initiated Change Lead Time

Sleuth is pleased to announce a new option to start your Change Lead Time clock based on state transitions in your issue tracker! In our ongoing effort to meet customers where they are, we heard from many of you that you’d like Sleuth to account for and provide visibility into your pre-commit coding time. We’re pleased to offer this this new option to tell Sleuth which specific state transitions in your issue tracker should start your Change Lead Time clock!

Share secrets with standalone projects with project context restrictions

Introducing project context restrictions for GitLab organizations. This feature enables project-based restrictions on contexts for standalone projects that are not tied to a VCS. Standalone projects are available at this time only with a GitLab integration with CircleCI. In this blog post, we hope to explain the value of this feature and how it can be used to further secure your workflows.

Comprehensive Guide on Partitioning and Sharding in Azure Database for PostgreSQL

One of the biggest mistakes I’ve had to repeatedly help companies fix has been poor partitioning design. I’ve seen many database architectures designed in an attempt to make queries faster. While faster queries can be a product of implementing partitioning correctly for a given design, I’ve often seen query response times get much slower from implementing partitioning incorrectly for the database design.

Collect traces, logs, and custom metrics from your Google Cloud Run services with Datadog

Google Cloud Run is a managed platform for the deployment, management, and scaling of workloads using serverless containers. You can deploy workloads in the cloud or, using Cloud Run for Anthos, on your on-prem infrastructure.

Why you should ditch your overly detailed incident response plan

When critical incidents happen — which they inevitably do 😅 — and you’re in the middle of trying to figure out what the best thing to do is, it can feel comforting to know that you’ve got a pre-prepared list of instructions to follow, commonly known as an “incident response plan”: In theory this sounds quite simple, and a typical flow you might envision is: It might be tempting to think that the hardest part of running incidents is finding or writing a checkl

How to Detect Anomalies and Why You Should Care

Companies today are relying on technology more than ever thanks to widespread digital transformation and cloud initiatives. And this is increasing the need for safe, efficient and reliable IT environments. But maintaining operational IT stability is very difficult when considering the complex and dynamic nature of today’s IT environments. In fact, IT environments are constantly changing, with new network devices, users and software versions coming into existence.