VAST Data
Nvidia

Why is AI so memory hungry?

Understanding KV Cache: The secret memory vault that makes LLMs fast.

Understanding Self-Attention

Transformer models use self-attention to understand relationships between words. Each word looks at all previous words to understand context.

Self-Attention Mechanism

Current Token: The
Attention Connections
The
cat
sat
"The" reads from itself
šŸ’” What the lines mean: Each line shows the current token "attending to" (reading information from) previous tokens. The thicker blue line shows self-attention (the token looking at itself).
Q
Query: What am I looking for?
K
Key: What do I represent?
V
Value: What information do I carry?

How KV Cache Helps

āŒ Without KV Cache

For each new word, recalculate K and V for ALL previous words. This is O(N²) complexity.

āœ… With KV Cache

Store previously computed K and V values. Only compute for the new word. This is O(N) complexity.

The Key Insight

Since previous tokens don't change, their K and V values remain constant. We can cache them and reuse them for every new token!

Self-Attention Formula

Attention(Q, K, V) = softmax(Q Ɨ KT / sqrt(dk)) Ɨ V
Q (Query)
Current token's question
K (Key)
All tokens' identities
← CACHED
V (Value)
All tokens' information
← CACHED

By caching K and V, we avoid recomputing them for every new token.

How LLMs Write

Imagine you are writing a story. To write the next word, you need to remember everything that happened before.

Large Language Models (LLMs) work the same way. They predict one word (token) at a time.

Input: "The quick brown fox"
Prediction: "jumps"

Thequickbrownfoxjumps

The model looks at all previous words to guess the next one.

The Problem: No Cache

Without a cache, the model must re-compute the entire sentence from the beginning to generate just one new word.

Step: 0 / 5

The Solution: KV Cache

With KV Cache, we save the computational need from previous words since we only need to calculate the new word. Thus, saving energy, cost and help protect the environment.

Step: 0 / 5

The Speed Advantage

Without KV Cache, the model has to re-compute everything for every new word.
With KV Cache, it only computes the new word.

TIME PER TOKEN
032k64k96k128k
CONTEXT LENGTH
Without KVCache
With KVCache
Current Context Length
0 tokens
Relative Speedup
1.0x
Faster generation at this context length

Why it matters

Without KV Cache, the computational cost grows linearly with every new token. For long contexts (e.g., 128k tokens), this makes generation prohibitively slow without caching.

The Cost: Memory Usage

KV Cache makes generation fast, but it eats up RAM. Newer models use tricks like GQA and MLA to reduce this cost.

Layers: 32Hidden Dim: 4096KV Heads: 8
2,048
1
Estimated KV Cache Size
0.2500 GB
Formula: 2 Ɨ Layers Ɨ KV_Heads Ɨ Head_Dim Ɨ Context Ɨ Batch Ɨ 2 Bytes
Optimized with GQA

Optimization Impact

Comparing GQA optimization vs unoptimized (Standard) architecture

Unoptimized (Standard)
1.00 GB
All 32 attention heads
Optimized (GQA)
0.25 GB
Only 8 KV heads
Memory Saved
75.0%
0.75 GB saved

Why this matters: Grouped Query Attention (GQA) shares key-value pairs across multiple query heads, reducing memory usage without sacrificing performance.

NvidiaDynamo

Enterprise-Scale LLM Inference

NVIDIA Dynamo optimizes LLM serving across distributed systems with intelligent KV cache management

Prefill-Decode Disaggregation

PREFILL
Compute-Intensive
Process prompt
Generate initial KV cache
DECODE
Memory-Bound
Use KV cache
Generate tokens 1-by-1
šŸ’” Why separate?

Different phases have different resource needs. Separating them maximizes GPU utilization and throughput.

KV Block Manager (KVBM)

GPU Memory
Fastest
Active
CPU Memory
Fast
Local SSD
Medium
Network Storage
Slower
šŸ’” Multi-Tier Offloading:

KV cache overflows to slower tiers when GPU memory is full, enabling larger context windows.

NIXL: Fast Data Transfer

GPU 1
GPU 2
KV Cache Block
Ultra-fast transfer via NVLink/RDMA
Low Latency
Non-blocking
High Bandwidth
Parallel transfer
šŸ’” NIXL enables:

Seamless KV cache sharing between prefill and decode engines across different nodes.

KV Cache-Aware Routing

Incoming Request
šŸ“ New Query
Worker A
Cache Hit
20%
Worker B
Cache Hit
75%
āœ“ Selected
Worker C
Cache Hit
45%
šŸ’” Smart routing:

Routes requests to workers with highest KV cache hit rate, reducing redundant computation.

How It All Works Together

1. Disaggregate

Separate prefill (compute) and decode (memory) phases for optimal resource use

2. Manage Cache

KVBM stores KV cache across GPU, CPU, SSD, and network tiers

3. Transfer Fast

NIXL moves KV cache between nodes with ultra-low latency

4. Route Smart

Direct requests to workers with best cache hit rates

Result: Enterprise-Scale Performance

Higher throughput, lower latency, and efficient resource utilization for serving LLMs at scale

Agentic AI

KV Cache for Agent Swarms

In Agentic AI, context is a continuous stream from sensors and cameras, not just a 2000-token prompt

Drone Swarm: Agent-to-Agent Handoff

Without Shared KV Cache

Drone A tracks vehicle → Low battery → Sends basic info to Drone B → Drone B restarts, wasting time to re-acquire target. Drone A's "memory" is lost.

Target
Drone A
Drone B
šŸ”µ Drone A tracking vehicle, building KV cache...

Zero Context Loss

Agents instantly inherit full context from previous agents, maintaining continuous awareness

Persistent Memory

KV cache becomes long-term memory for the entire agent swarm, stored on VAST

Instant Handoff

No time wasted re-acquiring targets or rebuilding context from scratch

Collective Intelligence Through Shared Memory

By transforming KV cache into persistent, shareable "long-term memory," agent swarms achieve true collective intelligence. Each agent builds upon the knowledge of all previous agents.

šŸ”„ Seamless Handoffs
⚔ Real-time Coordination
🧠 Shared Intelligence