Fintech · 2025 · 16 weeks

Fintech: 99.99% multi-region uptime

Illustrative composite · no named client · metrics typical of patterns we work in

Live in two regions with automatic failover and a reliability-vs-feature-velocity policy in place. Twelve months of operation through a cloud-region outage with no customer-facing downtime.

Problem

A regulated EU fintech ran a single-region active-passive setup that had been in place since the company was 5 engineers. The business had grown 10× and the architecture hadn't followed.

A 4-hour AWS regional incident the previous quarter had caused a 90-minute customer-facing outage. The board asked for a 99.99% target on the customer-facing API by the next quarter.

Approach

Built a second-region active-active setup using Istio for service-mesh-level traffic management, with Route 53 health checks and per-region circuit breakers.

Defined customer-facing SLOs (checkout success rate, login latency p99) and instrumented them through OpenTelemetry. Error budget became a board-visible metric.

Rolled out incident-response practices (on-call escalation, blameless post-mortems, weekly error-budget reviews) that the team owned by the end of the engagement.

Outcome

99.99% uptime
Reliability

Tech

KubernetesIstioOpenTelemetryPagerDuty
"We hit our 99.99% target three months ahead of the board commitment. The cloud incident in October had us at 100% throughout. That was the moment the board approved the next platform investment."

Related services

The engagement categories this case primarily covered.

Ready to talk?

Tell us what you're building.

Send a project brief and we'll reply within one business day, or book a 30-minute intro call directly.

Or book a slot →

Thanks, got it.

We'll reply within one business day at the email you provided. A real person reads every message; no auto-responders.