Key Takeaways:
- Dell’s scalable KV Cache offloading solution, powered by PowerScale and ObjectScale, delivers up to 19x faster Time to First Token (TTFT) versus standard vLLM, enabling higher inference performance and lower query response times.
- Freeing up GPU resources, Dell’s solution offloads the KV Cache to high-performance storage, overcoming memory bottlenecks and improving efficiency. Benchmark tests show Dell’s storage engines outperform competitors like VAST, delivering faster acceleration and better performance.
- Beyond inference, Dell’s AI Data Platform (AIDP) simplifies the entire AI data lifecycle—from raw data to knowledge creation—empowering organizations to operationalize AI at scale.
Large Language Models (LLMs) are transforming business operations, from enhancing customer interactions to accelerating content creation. As these AI models become more powerful, their computational demands grow, creating a significant challenge: performance bottlenecks that can stall progress and inflate costs. Many organizations believe the only answer is to add more expensive, power-hungry GPUs.
There is a smarter, more cost-effective path forward. The key lies in optimizing a process called Key-Value (KV) Caching. By offloading the KV Cache from GPU memory to high-performance storage and retrieving it instead of recomputing it, you can dramatically improve performance, reduce latency, and unlock a greater return on your AI investment.
This blog introduces Dell’s industry-leading KV Cache offloading solution. We’ll show you how our scalable, high-performance Storage Engines empower you to deploy large-scale LLMs with greater efficiency and at a lower cost.
The high cost of inefficient AI
To understand the solution, we first need to look at the problem. In LLM inference, the KV Cache stores intermediate calculations, so the model doesn’t have to recompute the same information for every new token it generates. This speeds up response times, measured as Time to First Token (TTFT).
However, as models grow and user demand increases, the KV Cache can consume gigabytes, or even terabytes, of memory. This quickly overwhelms your GPU’s memory, capacity leading to two undesirable outcomes:
- Performance Degradation: Your system slows down, creating poor user experience.
- Failed Workloads: The memory capacity bottleneck causes processes to fail entirely.
The conventional solution is to increase compute and more memory resources. This approach creates a cycle of escalating costs, increased power consumption, and greater infrastructure complexity. It addresses the symptom, not the root cause, and ultimately diminishes your AI ROI.
A breakthrough in AI efficiency: storage as the solution
Dell offers a transformative alternative: offload the KV Cache to our high-performance AI storage engines. Instead of relying solely on scaling compute, you can enhance your GPU investment by leveraging cost-effective, scalable storage—delivering a more efficient and sustainable AI infrastructure.
Our validated software stack, featuring the vLLM inference engine, LMCache with integrated Dell connector, and the NIXL data transfer library from NVIDIA Dynamo-extended by Dell with an S3-over-RDMA plugin for ObjectScale. This stack integrates seamlessly with Dell’s storage portfolio – PowerScale and ObjectScale. With support for both file and object storage, you can choose the best-fit solution for your environment and workload.
This strategy allows you to extend your KV Cache capacity far beyond the limits of GPU memory. It’s a game-changing solution for organizations looking to scale LLM inference efficiently, economically, and sustainably.
Performance that speaks for itself
We benchmarked our vLLM + LMCache + NVIDIA NIXL stack to measure Time to First Token (TTFT) performance with a 100% KV Cache hit rate. This scenario isolates the efficiency of retrieving a fully prepopulated KV Cache from external storage. The tests were run on 4x NVDIA H100 GPUs across Dell storage backends.
We compared our storage offloading solution to a baseline where the KV Cache is computed from scratch on the GPU, using the LLaMA-3.3-70B Instruct, with Tensor Parallelism=4 (Figure 1).

The results were remarkable:
- Dell storage engines—PowerScale and ObjectScale—delivered a 1-second TTFT at a full context window of 131K tokens.
- This represents a 19x improvement over the standard vLLM configuration, which took over 17 seconds at the same context size.
- Even at a small context size of 1K tokens, all three Dell engines retrieved the KV Cache from storage faster than could be computed on the GPU, highlighting the low-latency efficiency of our solution.
This significant TTFT reduction minimizes overhead, enables faster token generation, and ultimately improves GPU utilization and overall inference throughput.
How Dell’s solution compares to competition
We also ran TTFT testing using the Qwen3-32B model to compare Dell’s solution head-to-head versus a competitor’s solution. The results showed PowerScale (0.82 sec) and ObjectScale (0.86 sec), both outperforming VAST (1.5 sec) in TTFT – while delivering up to 14x acceleration over standard vLLM without KV Cache offloading (11.8 sec)¹.
While VAST’s solution demonstrates acceleration, Dell’s storage engines deliver higher acceleration and lower query response time, compared to VAST¹.
Shaping the future of AI infrastructure
KV cache offloading is more than a significant performance optimizer for today’s workloads; it is a foundational capability for the future of AI. These results demonstrate that a KV Cache store and retrieve solution with Dell storage engines enables organizations to achieve superior inference performance. For a deeper technical dive into our methodology and test configurations, please see our full technical blog.
With Dell’s AI Data Platform (AIDP), we go beyond just inference acceleration. AIDP addresses the entire data lifecycle for AI, simplifying the customer journey from raw data to transformed data, to knowledge creation, and finally to accelerating reasoning models and agentic applications.
Dell’s modular and open solution supports every stage of the AI lifecycle—from data ingestion to model training, inference, and deployment. By combining Dell’s high-performance storage engines, like PowerScale and ObjectScale, with advanced data engines and seamless integration with NVIDIA, the Dell AI Data Platform empowers organizations to operationalize AI at scale.
Ready to turn your data into results? Let’s schedule a workshop to explore how Dell’s AI Data Platform can support your specific use case.
1 Dell results are based on internal testing using the Qwen3-32B model. VAST results are based on testing information disclosed in the following VAST online source: Accelerating Inference – VAST Data. Oct. 2025.


