·  sre, storage, openai, platform-engineering


How we shipped 15 Tbps for OpenAI in 90 days (Session 1 of 3)

OpenAI wanted a second Object Storage instance, in customer-facing production, at 15 Tbps, in three months. Session 1 covers the first week: closing the architecture.

Session 1 of 3: closing the architecture

OpenAI came to us with a specific ask. They wanted a second instance of Object Storage in the same physical region as their primary, optimized for fast reads and writes directly from their GPU fleet. The throughput target was 15 Tbps. They needed it in three months.

15 Tbps is not a small number for any storage system. And it had to land in customer-facing production, alongside an existing instance already serving production customer traffic, without disturbing it.

Two things made the timeline harder than the headline suggested.

We had run multiple logical Object Storage instances inside the same region before, but only in pre-production environments. The architecture pattern existed. The operational experience of doing it for a paying customer, with real traffic and real SLAs, did not.

The use case also required a special network configuration. Getting 15 Tbps from a customer’s GPU fleet into Object Storage at this scale wasn’t something we could pull from an existing pattern.

The major architecture decisions had to happen in a single week. So we pulled the architects into a room, offsite, and stayed there until the network architecture for the use case was closed.

By the end of that week, two things existed that hadn’t before. A one-pager with the plan, milestones, and tasks. And the list of key teams that owned the delivery: more than five service teams in total, each with a named lead.

← All writing