Why Expensive GPUs Sit Idle

GPU utilization is a data problem before it is a compute problem, and the three forms of data are how you solve it.

By Jon Hyde | June 17, 2026June 15, 2026

Key takeaways 7 min read

- GPUs don’t underperform. They wait.
- Storage-embedded AI stacks assume data has already landed. Real enterprise estates say it often hasn’t, and sometimes can’t.
- KV cache offload is exactly where the three forms of data become measurable, not philosophical.
- External forces — regulation, application coupling, sovereignty — make “data lands first” an architecturally fragile assumption.
- Federated platforms feed GPUs from data where it lives.

If you’re buying NVIDIA H100s, your biggest risk isn’t picking the wrong GPU. It’s starving the right ones. The most expensive line item in any AI infrastructure is also the one most exposed to a quiet architectural mistake one layer below it: a data platform built to be fed, not to feed.

In the first post of this series, I laid out the structural laws: data has gravity, the real enterprise estate is full of distortions, and “data” is really three forms (data, metadata, vectors), each with different characteristics and requirements. In the second post of this series, I walked through the operational tax of architectures that pretend those laws don’t apply. This post follows the same logic down to the GPU floor, where it stops being a slide and starts burning power.

The visible symptom: GPUs that wait

Every AI leader who has stood in front of a utilization chart knows the moment. Two training clusters. Hundreds of H100s. Average utilization during active jobs sitting in the 30s. Anyone who has built an AI business case knows what that number means: the most expensive line item in the infrastructure is spending most of its active hours idle, waiting for something.

The thing it is waiting for is almost always data, more specifically, data that has not yet finished arriving in the namespace where the AI services can see it.

The bottleneck is rarely compute. It is rarely the interconnect. It is, again and again, the seam between a storage-embedded AI stack and the rest of the data estate. Jobs wait on pipelines. Pipelines wait on source systems. Source systems wait on change windows. Somewhere at the end of that chain, hundreds of thousands of dollars of GPU sit idle and burn power anyway.

The architectural assumption that puts them there

Storage-embedded AI stacks are built on an assumption that looks innocuous until it meets a real workload: the AI services will run only on data that has already landed in the platform.

That assumption is fine in a greenfield. It is also fine for a single, tightly scoped workload. It breaks down in exactly the case enterprise AI is becoming: many workloads, many models, many data sources, all changing constantly, with latency-sensitive inference layered on top of training that was already pushing the envelope.²

In that world, “the data has already landed” is a condition you have to keep true. And the external forces I described in the second post are precisely the forces that make it hard to keep true:

- Regulatory and sovereign constraints prevent some data from being copied at all.
- Application coupling ties source systems to change windows that AI workloads do not control.
- Data residency anchors data to a region the namespace doesn’t span.
- Volume and velocity, including telemetry, customer interactions and transactional streams, generate data faster than any sync engine can keep parity.

Every minute the landing isn’t caught up, the GPUs downstream wait. Every time a source system changes faster than the sync can keep up, the freshest data isn’t in the namespace when the job runs. Every time the namespace itself becomes the bottleneck for ingest, for metadata, for retrieval, every consumer of it slows down together.

The GPUs are not underperforming. They are waiting on an architecture that treated data movement as a precondition for doing work.

Where the three forms stop being framing and start being measurable: KV cache

There’s a specific technical wrinkle that makes the argument concrete, because it’s exactly where the difference between a storage-embedded stack and a three-forms architecture becomes a measurable number.

Modern inference, especially for large language models, is increasingly bottlenecked by the KV cache — the key/value tensors a model builds up as it processes context.

- Hold the cache in GPU memory and you’re fast but capacity-constrained.
- Evict and recompute it and you’re paying for the same work twice.
- Offload it to storage and you suddenly care a great deal about how fast, how parallel and how close that storage is to your GPUs.

KV cache, in the Physics of Enterprise AI sense, is exactly the kind of thing that should not be treated as ordinary data. It’s a vector – a structured representation of meaning that needs to travel between GPUs and storage at high frequency. An architecture that treats KV cache as ordinary data, that has to be ingested through the same path as a customer’s source data, is one that fights its own structure.

Dell published head-to-head testing in October 2025 on exactly this workload, using the Qwen3-32B model.^¹ The results, drawn from Dell’s internal testing and a public VAST disclosure:

- PowerScale: 0.82-second Time to First Token (TTFT)
- ObjectScale: 0.86-second TTFT
- VAST: 1.5-second TTFT
- Standard vLLM without KV cache offloading: 11.8-second TTFT

In other words, Dell’s storage engines delivered up to 19x faster TTFT versus baseline vLLM, and roughly 14x acceleration over standard vLLM without KV cache offload at all.

Both architectures accelerated inference. But an open, GPU-optimized storage foundation accelerated it further, with lower query response times and better GPU utilization as a consequence. The numbers will move as everyone optimizes. The direction is what matters: an architecture built to feed GPUs, treating the heavy data as gravity-bound and the meaning-bearing vectors as portable — compounds the advantage with every inference request. An architecture where feeding the GPU depends on data first arriving in a vendor namespace is structurally a step behind.^³

Since that head-to-head, VAST has continued to publish on KV cache offload — most recently a December 2025 result with NVIDIA Dynamo and CoreWeave reporting a roughly 20x TTFT improvement over recompute and a 90% gain in GPU efficiency — but on a different workload and baseline, and without a Qwen3-32B head-to-head that improves on the 1.5 second TTFT above.⁵

The RFP question that separates a storage product from an AI data platform

The original generation of AI infrastructure RFPs asked about IOPS, bandwidth and capacity. Those are the questions you ask of a storage product. They are the wrong questions to ask of an AI data platform.

The distinguishing question — the one most enterprises don’t ask until eighteen months in — is: What will my GPU utilization look like on this platform under a realistic inference workload, with KV cache offload, against a data estate that includes regulated, sovereign and application-coupled sources?

That question forces a vendor to confront the structural argument. It forces them to answer for what happens when data can’t land in their namespace, not just what happens when it has. It forces them to publish reproducible TTFT, tokens-per-second and cache hit rates on a current open-source model — methodology and all. And it forces an honest answer to the only economic question that matters: how much GPU idle time is your architecture budgeting for?

The federated answer

Dell AI Data Platform is designed around the opposite assumption from a storage-embedded stack. Rather than requiring data to arrive in a vendor namespace before GPUs can use it, the platform is built to feed GPUs from data where it already lives — across PowerScale, ObjectScale, third-party storage, warehouses and cloud — with KV cache offload, high-bandwidth parallelism and inference acceleration engineered into the stack from the ground up.⁴

That works because it operates on the three forms of data the way they actually behave:

- The heavy data stays where it lives: governed, sovereign, secure, where regulation, application coupling and ownership require it to stay.
- The metadata propagates so the GPU layer can see and select against the entire estate without first ingesting it.
- The vectors — including KV cache as a special case — travel at the speeds and proximities GPUs require, because the architecture treats them as a distinct first-class citizen, not as a special kind of file.

The result is exactly what the dashboard wants to see: GPUs that spend more of their active hours doing work, and fewer of them waiting on the plumbing.

Three questions to ask before you sign

These go straight into your RFP.

1. What is my expected GPU utilization under a realistic inference workload with KV cache offload on your platform, against a data estate that includes my regulated, sovereign and application-coupled sources? If the vendor can only answer in IOPS and bandwidth, you’re buying a storage product, not an AI data platform.
2. Publish a current head-to-head on a current open-source model and stand behind the configuration when a competitor reproduces it.
3. When a new data source comes online, including one that simply cannot be copied, how long until my GPUs can train or infer on it? The honest answer is a direct measure of how much GPU idle time your architecture is budgeting for.

The shorter version: you didn’t buy expensive GPUs to watch them wait on constraints your architecture refused to acknowledge.

What’s next

The utilization number hides a second problem that only shows up when you try to grow. The next constraint isn’t compute. It isn’t even storage. It’s the building itself, and the way an architecture that was already fighting gravity scales the fight upward into power, cooling and floorspace. That’s the next post.

Data has gravity. AI creates more of it. The real enterprise estate is full of distortions. Data exists in three forms: data, metadata, vectors, each with different requirements. The architecture that gets all four right is the one that turns idle GPUs into working ones — because it was built for the estate enterprises actually operate in, not the one a reference design assumes.

¹Dell Technologies, “Dell Storage Engines: Accelerating AI inferencing with PowerScale and ObjectScale,” October 2025. Dell testing used the Qwen3-32B model. VAST results drawn from VAST Data, “Accelerating Inference,” July 2025.

²NAND Research, “How to Think about VAST Data,” February 2026.

³Prowess Consulting, commissioned by Dell, “Architectural and Operational Comparison: Dell AI Data Platform vs. VAST AI OS,” April 2026.

⁴Dell Technologies, “Dell AI Data Platform with NVIDIA Supercharges Enterprise AI with Breakthrough Data Orchestration and Storage Innovations,” PR Newswire, March 2026.

⁵VAST Data, “NVIDIA Dynamo + VAST = Scalable, Optimized Inference,” December 2025.