AI startup: GPU Kubernetes platform in 5 weeks
Illustrative composite · no named client · metrics typical of patterns we work in
GPU-ready production platform on a managed cloud, automated deployment pipelines for model training, signed releases, and secrets handling done right. Team self-sufficient by week 6.
Problem
An early-stage AI startup with a working AWS account but no production Kubernetes needed managed GPU nodes for a TypeScript frontend, a Python inference service, and a training-job runner. The team was 4 engineers, none of whom had run Kubernetes in production.
No CI/CD pipeline beyond branch builds, no image scanning, no signing, no secrets-management story. Models and weights were promoted by SCP and tribal knowledge.
Approach
Stood up a Kubernetes cluster with Karpenter-driven mixed CPU/GPU node groups. Spot GPUs for training workloads with checkpointing; on-demand GPUs for low-latency inference.
Wired CI through GitHub Actions with multi-stage container builds, Sigstore signing, and Trivy scanning gates. Secrets land via External Secrets from the cloud KMS, never in a repo.
Built a `pl new` scaffold the team uses to bootstrap new services with the platform's CI defaults already wired.
Outcome
Tech
"We deployed ten times the day the platform went live. Two weeks later we'd added two more services and the team hadn't touched a kubectl command."
Related services
The engagement categories this case primarily covered.
Tell us what you're building.
Send a project brief and we'll reply within one business day, or book a 30-minute intro call directly.
Thanks, got it.
We'll reply within one business day at the email you provided. A real person reads every message; no auto-responders.