Logistics startup: incident response
Illustrative composite · no named client · metrics typical of patterns we work in
Incident workflows wired around customer-facing reliability targets, automatic playbook triggering, and incident grouping that points at the source. Median time-to-resolution dropped from 32 minutes to 9 minutes across a 200-service fleet.
Problem
A late-Series-A logistics startup ran a 200-microservice Go fleet with on-call distributed across product engineers. MTTR was averaging 32 minutes; on-call burnout had become a retention issue.
Most incidents were misclassified. Engineers spent 15+ minutes ruling out infrastructure issues that were actually downstream service failures. Runbooks existed but were stale and PDF-distributed.
Approach
OpenTelemetry adoption was already underway in the most critical services; we extended it across the fleet and built a service-dependency graph from trace data that drove the incident-response surface.
Wrote runbooks as executable code: Markdown with embedded Grafana queries, rendered in the on-call UI when an alert fires. Stale runbooks fail CI.
Trained an internal classification model on three months of incident history to suggest service-of-origin within 30 seconds of alert firing. Confidence-scored; humans always make the call.
Outcome
Tech
"We stopped paging product engineers for things that weren't theirs. The on-call satisfaction survey moved from 2.4 to 4.1 in two months."
Related services
The engagement categories this case primarily covered.
Tell us what you're building.
Send a project brief and we'll reply within one business day, or book a 30-minute intro call directly.
Thanks, got it.
We'll reply within one business day at the email you provided. A real person reads every message; no auto-responders.