Fintech: 99.99% multi-region uptime
Illustrative composite · no named client · metrics typical of patterns we work in
Live in two regions with automatic failover and a reliability-vs-feature-velocity policy in place. Twelve months of operation through a cloud-region outage with no customer-facing downtime.
Problem
A regulated EU fintech ran a single-region active-passive setup that had been in place since the company was 5 engineers. The business had grown 10× and the architecture hadn't followed.
A 4-hour AWS regional incident the previous quarter had caused a 90-minute customer-facing outage. The board asked for a 99.99% target on the customer-facing API by the next quarter.
Approach
Built a second-region active-active setup using Istio for service-mesh-level traffic management, with Route 53 health checks and per-region circuit breakers.
Defined customer-facing SLOs (checkout success rate, login latency p99) and instrumented them through OpenTelemetry. Error budget became a board-visible metric.
Rolled out incident-response practices (on-call escalation, blameless post-mortems, weekly error-budget reviews) that the team owned by the end of the engagement.
Outcome
Tech
"We hit our 99.99% target three months ahead of the board commitment. The cloud incident in October had us at 100% throughout. That was the moment the board approved the next platform investment."
Related services
The engagement categories this case primarily covered.
Tell us what you're building.
Send a project brief and we'll reply within one business day, or book a 30-minute intro call directly.
Thanks, got it.
We'll reply within one business day at the email you provided. A real person reads every message; no auto-responders.