AI/ML platforms · 2025 · 5 weeks

AI startup: GPU Kubernetes platform in 5 weeks

Illustrative composite · no named client · metrics typical of patterns we work in

GPU-ready production platform on a managed cloud, automated deployment pipelines for model training, signed releases, and secrets handling done right. Team self-sufficient by week 6.

Problem

An early-stage AI startup with a working AWS account but no production Kubernetes needed managed GPU nodes for a TypeScript frontend, a Python inference service, and a training-job runner. The team was 4 engineers, none of whom had run Kubernetes in production.

No CI/CD pipeline beyond branch builds, no image scanning, no signing, no secrets-management story. Models and weights were promoted by SCP and tribal knowledge.

Approach

Stood up a Kubernetes cluster with Karpenter-driven mixed CPU/GPU node groups. Spot GPUs for training workloads with checkpointing; on-demand GPUs for low-latency inference.

Wired CI through GitHub Actions with multi-stage container builds, Sigstore signing, and Trivy scanning gates. Secrets land via External Secrets from the cloud KMS, never in a repo.

Built a `pl new` scaffold the team uses to bootstrap new services with the platform's CI defaults already wired.

Outcome

5-week k8s
Time to ready

Tech

KubernetesKarpenterSigstoreTrivy
"We deployed ten times the day the platform went live. Two weeks later we'd added two more services and the team hadn't touched a kubectl command."

Related services

The engagement categories this case primarily covered.

Ready to talk?

Tell us what you're building.

Send a project brief and we'll reply within one business day, or book a 30-minute intro call directly.

Or book a slot →

Thanks, got it.

We'll reply within one business day at the email you provided. A real person reads every message; no auto-responders.