Logistics & supply chain · 2024 · 8 weeks

Logistics startup: incident response

Illustrative composite · no named client · metrics typical of patterns we work in

Incident workflows wired around customer-facing reliability targets, automatic playbook triggering, and incident grouping that points at the source. Median time-to-resolution dropped from 32 minutes to 9 minutes across a 200-service fleet.

Problem

A late-Series-A logistics startup ran a 200-microservice Go fleet with on-call distributed across product engineers. MTTR was averaging 32 minutes; on-call burnout had become a retention issue.

Most incidents were misclassified. Engineers spent 15+ minutes ruling out infrastructure issues that were actually downstream service failures. Runbooks existed but were stale and PDF-distributed.

Approach

OpenTelemetry adoption was already underway in the most critical services; we extended it across the fleet and built a service-dependency graph from trace data that drove the incident-response surface.

Wrote runbooks as executable code: Markdown with embedded Grafana queries, rendered in the on-call UI when an alert fires. Stale runbooks fail CI.

Trained an internal classification model on three months of incident history to suggest service-of-origin within 30 seconds of alert firing. Confidence-scored; humans always make the call.

Outcome

70% MTTR drop
Incident response

Tech

PrometheusGrafanaOpenTelemetryPagerDuty
"We stopped paging product engineers for things that weren't theirs. The on-call satisfaction survey moved from 2.4 to 4.1 in two months."

Related services

The engagement categories this case primarily covered.

Ready to talk?

Tell us what you're building.

Send a project brief and we'll reply within one business day, or book a 30-minute intro call directly.

Or book a slot →

Thanks, got it.

We'll reply within one business day at the email you provided. A real person reads every message; no auto-responders.