Dell Storage Engines: Accelerating AI inferencing with PowerScale and ObjectScale

Dell's KV Cache offloading solution enables up to 19x faster Time to First Token over standard vLLM configuration, to support large-scale LLMs with greater efficiency.

Key Takeaways:

  • Dell’s scalable KV Cache offloading solution, powered by PowerScale and ObjectScale, delivers up to 19x faster Time to First Token (TTFT) versus standard vLLM, enabling higher inference performance and lower query response times.
  • Freeing up GPU resources, Dell’s solution offloads the KV Cache to high-performance storage, overcoming memory bottlenecks and improving efficiency. Benchmark tests show Dell’s storage engines outperform competitors like VAST, delivering faster acceleration and better performance.
  • Beyond inference, Dell’s AI Data Platform (AIDP) simplifies the entire AI data lifecycle—from raw data to knowledge creation—empowering organizations to operationalize AI at scale.

Large Language Models (LLMs) are transforming business operations, from enhancing customer interactions to accelerating content creation. As these AI models become more powerful, their computational demands grow, creating a significant challenge: performance bottlenecks that can stall progress and inflate costs. Many organizations believe the only answer is to add more expensive, power-hungry GPUs.

There is a smarter, more cost-effective path forward. The key lies in optimizing a process called Key-Value (KV) Caching. By offloading the KV Cache from GPU memory to high-performance storage and retrieving it instead of recomputing it, you can dramatically improve performance, reduce latency, and unlock a greater return on your AI investment.

This blog introduces Dell’s industry-leading KV Cache offloading solution. We’ll show you how our scalable, high-performance Storage Engines empower you to deploy large-scale LLMs with greater efficiency and at a lower cost.

The high cost of inefficient AI

To understand the solution, we first need to look at the problem. In LLM inference, the KV Cache stores intermediate calculations, so the model doesn’t have to recompute the same information for every new token it generates. This speeds up response times, measured as Time to First Token (TTFT).

However, as models grow and user demand increases, the KV Cache can consume gigabytes, or even terabytes, of memory. This quickly overwhelms your GPU’s memory, capacity leading to two undesirable outcomes:

  1. Performance Degradation: Your system slows down, creating poor user experience.
  2. Failed Workloads: The memory capacity bottleneck causes processes to fail entirely.

The conventional solution is to increase compute and more memory resources. This approach creates a cycle of escalating costs, increased power consumption, and greater infrastructure complexity. It addresses the symptom, not the root cause, and ultimately diminishes your AI ROI.

A breakthrough in AI efficiency: storage as the solution

Dell offers a transformative alternative: offload the KV Cache to our high-performance AI storage engines. Instead of relying solely on scaling compute, you can enhance your GPU investment by leveraging cost-effective, scalable storage—delivering a more efficient and sustainable AI infrastructure.

Our validated software stack, featuring the vLLM inference engine, LMCache with integrated Dell connector, and the NIXL data transfer library from NVIDIA Dynamo-extended by Dell with an S3-over-RDMA plugin for ObjectScale. This stack integrates seamlessly with Dell’s storage portfolio – PowerScale and ObjectScale. With support for both file and object storage, you can choose the best-fit solution for your environment and workload.

This strategy allows you to extend your KV Cache capacity far beyond the limits of GPU memory. It’s a game-changing solution for organizations looking to scale LLM inference efficiently, economically, and sustainably.

Performance that speaks for itself

We benchmarked our vLLM + LMCache + NVIDIA NIXL stack to measure Time to First Token (TTFT) performance with a 100% KV Cache hit rate. This scenario isolates the efficiency of retrieving a fully prepopulated KV Cache from external storage. The tests were run on 4x NVDIA H100 GPUs across Dell storage backends.

We compared our storage offloading solution to a baseline where the KV Cache is computed from scratch on the GPU, using the LLaMA-3.3-70B Instruct, with Tensor Parallelism=4 (Figure 1).

PowerScale, ObjectScale, artificial intelligence, AI inferencing, Dell Technologies, Dell AI Data Platform, AIDP, storage, AI Solutions
Figure 1.

The results were remarkable:

  • Dell storage engines—PowerScale and ObjectScale—delivered a 1-second TTFT at a full context window of 131K tokens.
  • This represents a 19x improvement over the standard vLLM configuration, which took over 17 seconds at the same context size.
  • Even at a small context size of 1K tokens, all three Dell engines retrieved the KV Cache from storage faster than could be computed on the GPU, highlighting the low-latency efficiency of our solution.

This significant TTFT reduction minimizes overhead, enables faster token generation, and ultimately improves GPU utilization and overall inference throughput.

How Dell’s solution compares to competition

We also ran TTFT testing using the Qwen3-32B model to compare Dell’s solution head-to-head versus a competitor’s solution. The results showed PowerScale (0.82 sec) and ObjectScale (0.86 sec), both outperforming VAST (1.5 sec) in TTFT – while delivering up to 14x acceleration over standard vLLM without KV Cache offloading (11.8 sec)¹.

While VAST’s solution demonstrates acceleration, Dell’s storage engines deliver higher acceleration and lower query response time, compared to VAST¹.

Shaping the future of AI infrastructure

KV cache offloading is more than a significant performance optimizer for today’s workloads; it is a foundational capability for the future of AI. These results demonstrate that a KV Cache store and retrieve solution with Dell storage engines enables organizations to achieve superior inference performance. For a deeper technical dive into our methodology and test configurations, please see our full technical blog.

With Dell’s AI Data Platform (AIDP), we go beyond just inference acceleration. AIDP addresses the entire data lifecycle for AI, simplifying the customer journey from raw data to transformed data, to knowledge creation, and finally to accelerating reasoning models and agentic applications.

Dell’s modular and open solution supports every stage of the AI lifecycle—from data ingestion to model training, inference, and deployment. By combining Dell’s high-performance storage engines, like PowerScale and ObjectScale, with advanced data engines and seamless integration with NVIDIA, the Dell AI Data Platform empowers organizations to operationalize AI at scale.

Ready to turn your data into results? Let’s schedule a workshop to explore how Dell’s AI Data Platform can support your specific use case.


1 Dell results are based on internal testing using the Qwen3-32B model. VAST results are based on testing information disclosed in the following VAST online source: Accelerating Inference – VAST Data. Oct. 2025.

About the Author: David Noy

David Noy is a 25 year veteran of the storage and data management industry with deep, hands on expertise in data center infrastructure, enterprise and cloud data storage, and solutions for Artificial Intelligence. After more than a decade directing engineering organizations—and subsequent leadership of high impact product management and technical marketing teams—he has shaped flagship portfolios at Dell Technologies, NetApp, Veritas, Cohesity, and VAST Data. He has been the global executive leader for enterprise product lines recognized by Gartner as #1 in their category. As Vice President of Product Management for Unstructured Data Solutions at Dell Technologies, David oversees the end to end strategy for enterprise, high performance computing, and artificial intelligence workloads. This includes responsibility for the Dell AI Data Platform which includes data engines and storage engines. His remit spans product conception, roadmap execution, and go to market alignment—delivering infrastructure that not only scales but also integrates advanced data management, cyber resilience, and hybrid cloud capabilities into a single, coherent platform. Industry context • Explosive growth of unstructured data: AI, edge telemetry, and rich media are driving compound annual growth >25 %, demanding file/object architectures that scale linearly and economically. • Hybrid and multi cloud deployments: Enterprises now treat cloud as an operating model, not a destination; seamless data mobility and consistent policy enforcement are table stakes. • AI and GPU acceleration: Modern AI pipelines require parallel file and object stores that can saturate the latest high speed networks while guaranteeing metadata efficiency. • Cyber resilience & compliance: Immutable snapshots, object lock, and zero trust architectures have become mandatory in the face of ransomware and evolving data sovereignty laws. David’s track record of shipping innovative, enterprise grade solutions at global scale directly aligns with these trends, positioning him to lead the next wave of file and object innovation that accelerates customers’ digital transformation and AI ambitions—on premises, at the edge, and in the cloud.