RouteKV Compiler: Smarter KV Cache for LLM Inference

Imagine you're running a massive LLM in production think 128K context windows, thousands of concurrent users, all hammering your GPU cluster. At some point, you notice something weird: your GPUs aren't compute-bound. They're sitting there waiting. Waiting for memory. Specifically, they're waiting for the Key-Value (KV) cache to shuffle between different memory tiers.

This is the real bottleneck that nobody talks about enough. And it's exactly the problem that RouteKV Compiler is designed to solve.

This blog is a deep dive into RouteKV, what it is, why it exists, how it works under the hood, and how you can use it. We'll connect the dots from first principles all the way to actual code. Let's go.

Part 1: What is the KV Cache?

Before we can appreciate what RouteKV does, we need to build up from the atom the smallest, most fundamental unit of the problem.

In a Transformer model, every token you generate requires attending over all previous tokens. This is how models "remember" what was said earlier in a conversation. During this attention computation, the model calculates Keys (K) and Values (V) for every token at every layer. These tensors are expensive to recompute, so modern serving systems cache them this is your KV cache.

Let's make this concrete:

# Simplified attention with KV cache
def attention_with_kv_cache(query, key, value, kv_cache):
    # Append new K, V to cache
    kv_cache['keys'].append(key)    # shape: [1, num_heads, head_dim]
    kv_cache['values'].append(value)

    # Attend over ALL cached keys and values
    all_keys = torch.cat(kv_cache['keys'], dim=0)   # [seq_len, num_heads, head_dim]
    all_values = torch.cat(kv_cache['values'], dim=0)

    # Standard scaled dot-product attention
    scores = torch.matmul(query, all_keys.transpose(-1, -2)) / math.sqrt(head_dim)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, all_values)

So the KV cache is essentially a running memory of the conversation. Every single token in every single layer gets an entry. For a model with 32 layers, 32 attention heads, and a 4096 head dimension running on a 128K context window, the KV cache alone can consume tens of gigabytes of GPU memory per request.

This is where things start to hurt.

Part 2: The Memory Hierarchy — Why One Tier is Never Enough

GPU memory (HBM — High Bandwidth Memory) is fast. Blazing fast. But it's also incredibly expensive and limited. A high-end A100 has 80GB. An H100 SXM gives you 80GB too. When you're running multiple requests simultaneously with long contexts, the KV cache fills up fast.

Here's what the memory hierarchy looks like in a typical GPU server:

The brutal truth: HBM is where you want your KV cache for speed, but you physically can't fit everything there. So systems start offloading older or less-accessed blocks to DRAM or even disk. The moment that happens, attention computation has to stall or overlap with data movement across PCIe.

Part 3:

How Existing Systems Fail

(And Why That's OK, Sort Of)

Before RouteKV, the community built a lot of great systems. Let's be fair to them — they moved the needle massively. But each one punted on at least one critical dimension:

vLLM — The gold standard for LLM serving. Uses paged attention to manage KV blocks like virtual memory in an OS. When memory pressure hits, it evicts blocks using Least Recently Used (LRU). The problem? LRU is workload-agnostic. It doesn't know which tokens are actually important to upcoming attention computations. It just evicts what hasn't been touched recently.

SGLang RadixAttention — Brilliant for prefix sharing. Multiple requests that share a common system prompt only compute (and store) the KV for that prefix once. But it doesn't do per-layer tiering — a layer-5 KV block is treated the same as a layer-30 block even though their attention patterns and importance profiles can be radically different.

ScoutAttention — Hides PCIe latency by layer-ahead prefetching: while layer N computes, it prefetches KV for layer N+1 from CPU. Smart! But it uses fixed heuristics. It can't adapt to the actual workload characteristics at runtime.

KVTC / LAVa / MixKV — These compress or evict KV at a single stage without coordinating with downstream routing decisions. Compressing KV on GPU is useless if the routing logic doesn't know the KV is now in a compressed format requiring a different kernel.

The root cause of all these gaps? Each system treats one dimension in isolation:

WHERE to place KV
HOW to represent/compress it
WHEN to move it
WHICH requests should share it

None of them answer all four questions together. That's the gap RouteKV fills.

Part 4:

Enter RouteKV — A Compiler for KV Memory

RouteKV Compiler is a research prototype that reframes KV cache management as a compilation problem. Just like how a compiler transforms source code into optimized machine code by analyzing the program holistically, RouteKV analyzes your LLM workload holistically and produces an optimized KV placement plan.

The central insight: decisions about where KV lives, how it's encoded, when it moves, and who shares it are all deeply interdependent. You can't optimize one without reasoning about the others. RouteKV makes these decisions jointly, at runtime, driven by learned cost models.

The goal? Up to 2-4x throughput improvement and 3-5x GPU memory reduction for long-context inference with less than 3% quality degradation.

Let's break down how it achieves this.

Part 5:

The Architecture - A Tree of Components

RouteKV is structured as a three-stage pipeline feeding into a unified runtime engine. Let's walk through each node in the tree.

Node 1: The Profiler (Offline, One-Time)

The profiler runs offline before you deploy. It instruments a model during inference on a representative dataset and collects:

Attention patterns per layer
KV access frequencies
Cache hit/miss statistics
Token importance distributions

This data becomes the training signal for the cost model.

python -m routekv.profiler.collect \
  --model meta-llama/Llama-3.2-1B \
  --dataset longbench \
  --output traces/llama_1b_traces.pkl

Think of this as the "analyze" phase of a traditional compiler. The profiler tells RouteKV what the access patterns look like, which layers have hot KV blocks, and which blocks are barely ever touched.

Node 2: The Cost Model (Learned, Per-Workload)

This is where RouteKV gets smart. Rather than hardcoded heuristics, it trains a small neural network (the cost model) on the profiling traces. This model learns to predict, for a given KV block (identified by layer, position, request context), which memory tier it should live in.

python -m routekv.cost_model.train \
  --traces traces/llama_1b_traces.pkl \
  --output models/cost_model.pt

The cost model is the analogy to a compiler's optimizer. It doesn't just apply fixed rules — it has learned the cost tradeoffs specific to your model and workload.

Node 3: The Tier Plan Compiler (Online / JIT)

At inference time, the tier plan compiler runs JIT (Just-In-Time). For each incoming request, it:

Reads the request metadata and current memory state
Queries the cost model for tier assignments
Emits a tiering plan: a per-layer, per-block schedule of where KVs should live

This is the "codegen" phase — the output is a concrete execution plan.

Node 4: The Runtime Engine

The runtime engine executes the tier plan. It manages four memory pools:

HBM Pool (hot): Critical KV blocks that will be accessed imminently. Kept on GPU.
CPU DRAM Pool (warm): Blocks likely to be needed soon. On CPU, transferred asynchronously.
Transformed Store (cold): Compressed or quantized KV blocks. Cheaper to store, more expensive to decode.
Shared KV Store (reuse): KV blocks shared across multiple requests (like common prefixes). Computed once, reused everywhere.

Node 5: The Attention Kernel Dispatcher

The dispatcher is the low-level execution engine. It runs custom CUDA and Triton kernels that are aware of tier placement:

For HBM-resident KV: standard fast attention
For mixed HBM+DRAM KV: GPU-CPU co-attention with async PCIe transfers
Block-wise sparse scoring to skip blocks with near-zero attention weights
Asynchronous prefetch streams to overlap compute with data movement

Part 6:

Getting Started - Installation & Setup

RouteKV is a Python package with CUDA dependencies. Here's what you need:

Requirements:

Python >= 3.10
PyTorch >= 2.2
CUDA >= 11.8
Triton >= 2.1
transformers, accelerate

Install:

git clone https://github.com/harsha-mangena/routekv-compiler
cd routekv-compiler
pip install -e ".[dev]"

The .[dev] flag pulls in all development and experiment extras including benchmarking utilities and Jupyter notebook support.

Repository structure after install:

routekv-compiler/
├── routekv/              # The core library
│   ├── profiler/         # Offline trace collection
│   ├── cost_model/       # Learned tier assignment model
│   ├── compiler/         # JIT tier plan compiler
│   ├── runtime/          # Memory pool management
│   ├── kernels/          # CUDA/Triton kernels
│   ├── scoring/          # KV importance scoring
│   └── policies/         # Tiering policies
├── configs/              # YAML experiment configs
├── experiments/          # Benchmark scripts
├── notebooks/            # Colab-ready research notebooks
├── tests/                # Unit + integration tests
└── docs/                 # Architecture diagrams

Part 7:

End-to-End Walkthrough, Three Real Scenarios

Let's connect everything with three concrete usage patterns that map to real production situations.

Scenario 1: You have a new model and want to baseline it

You just downloaded Llama-3.2-1B and want to see how RouteKV improves it on long-context benchmarks.

Step 1: Profile the model

python -m routekv.profiler.collect \
  --model meta-llama/Llama-3.2-1B \
  --dataset longbench \
  --output traces/llama_1b_traces.pkl

This runs the model on LongBench (a long-context benchmark dataset) and records KV access traces. You get a .pkl file with per-layer, per-position attention statistics.

Step 2: Train the cost model

python -m routekv.cost_model.train \
  --traces traces/llama_1b_traces.pkl \
  --output models/cost_model.pt

The trainer takes those traces and fits a small model that predicts tier assignments. This typically runs in minutes on a CPU.

Step 3: Run the benchmark

# Baseline: vanilla LRU (no RouteKV)
python experiments/benchmark_tiering.py \
  --config configs/llama_1b_baseline.yaml

# RouteKV with 3-tier hierarchy
python experiments/benchmark_tiering.py \
  --config configs/llama_1b_tier3.yaml

You'll get a report comparing tokens/sec, GPU memory usage, and quality metrics side by side.

Scenario 2: Plugging RouteKV into your application

You want to use RouteKV as a drop-in wrapper for inference in your application:

# conceptual API - subject to change as the project evolves
from routekv.runtime import RouteKVEngine

# Initialize the engine with your trained cost model
engine = RouteKVEngine.from_pretrained(
    base_model="meta-llama/Llama-3.2-1B",
    cost_model_path="models/cost_model.pt",
    tier_config_path="configs/llama_1b_tier3.yaml",
)

# Generate as normal - RouteKV handles KV placement transparently
prompt = "Analyze this 50,000 word document and summarize the key themes..."
output = engine.generate(prompt, max_new_tokens=512)
print(output)

Under the hood, for each token generated:

RouteKV queries the tier plan compiler to decide where each layer's KV should live
The runtime engine moves blocks as needed (asynchronously on PCIe streams)
The kernel dispatcher picks the right attention kernel for the current KV layout
You get tokens back at 2-4x the throughput of vanilla inference

Scenario 3: Integrating with an existing serving stack

If you have a production setup with your own scheduler (like a custom vLLM wrapper), RouteKV can operate as a KV backend:

# conceptual API - subject to change
from routekv.runtime import RouteKVSession
from my_serving_stack import RequestBatch

# Initialize a stateful RouteKV session
rk_session = RouteKVSession.load(
    model_name="meta-llama/Llama-3.2-1B",
    cost_model_path="models/cost_model.pt"
)

# Your serving loop
while True:
    batch: RequestBatch = scheduler.next_batch()

    # RouteKV handles KV placement - your scheduler owns everything else
    logits, new_kv_state = rk_session.step(
        input_ids=batch.input_ids,
        kv_state=batch.kv_state,
        metadata=batch.metadata,  # conversation_id, route hints, etc.
    )

    scheduler.return_step_results(batch.request_ids, logits, new_kv_state)

This is the most powerful integration pattern. Your scheduler still controls batching, admission, and SLAs. RouteKV purely owns the KV placement decisions, acting like a specialized memory controller sitting below your serving logic.

Part 8:

The Scoring Module

How Does RouteKV Know What's Important?

One of the most clever pieces of RouteKV is the scoring module. It implements KV importance and diversity scoring to decide which blocks are truly "hot" vs which ones can safely be evicted or compressed.

The key question: which KV blocks matter most for upcoming attention?

Two signals answer this:

Importance Score: For each KV block, how large are the attention weights that point to it? Blocks with high aggregate attention weights across many queries are important. Blocks that are barely attended to are candidates for eviction or compression.

Diversity Score: This is more subtle. Even if a block has moderate importance, it might be the only block representing a specific semantic concept. Evicting it would cause an irrecoverable quality drop. The diversity score penalizes eviction of blocks that are "unique" in the representation space.

These two scores combine to give a tiering priority for each block:

# Conceptual illustration of the scoring logic
def compute_tier_priority(kv_block, attention_stats):
    # How much attention weight flows through this block
    importance = attention_stats.mean_weight_to_block(kv_block)

    # How different is this block from others in the cache
    diversity = compute_pairwise_diversity(kv_block, attention_stats.all_blocks)

    # Combined priority (higher = keep on HBM)
    priority = alpha * importance + beta * diversity
    return priority

# Tier assignment based on priority thresholds
if priority > HOT_THRESHOLD:
    tier = "hbm"       # Keep on GPU
elif priority > WARM_THRESHOLD:
    tier = "dram"      # Offload to CPU
elif priority > COLD_THRESHOLD:
    tier = "transformed"  # Compress it
else:
    tier = "evict"     # Drop entirely

This is the "brain" of RouteKV. Without this scoring, you'd fall back to LRU heuristics. With it, you get workload-aware placement that can be 2-4x more efficient.

Part 9:

Performance Targets

What Are We Actually Shooting For?

RouteKV is a research prototype and the team has set clear stretch goals to guide design. These aren't guaranteed benchmarks (the README makes that explicit), but they represent the design ambition:

Metric	Baseline (LRU-style)	RouteKV
Tokens/sec (long context)	1x	2-4x
GPU memory footprint	1x	3-5x reduction
LongBench quality	100%	>= 97%
Cold-tier accuracy drop	N/A	< 2%

The target is radical: reduce GPU memory pressure by 3-5x while maintaining near-identical output quality and more than doubling throughput. This would fundamentally change the economics of serving long-context LLMs.

To put this in perspective: if you're currently running 10 concurrent long-context requests on an 80GB A100, RouteKV's memory reduction could let you run 30-50 concurrent requests on the same hardware — or run larger models that were previously impossible to serve.

Part 10:

Connecting the Dots

How Everything Flows Together

Now let's zoom out and trace the complete lifecycle of a single request through the RouteKV system. This is where the intra-component connections become vivid.

The intra-system dependencies:

The Profiler's output directly shapes the Cost Model's training distribution
The Cost Model's predictions feed the Tier Plan Compiler's decisions
The Tier Plan directly governs what memory tier each KV block lands in
The Kernel Dispatcher must know the tier layout to pick the right kernel — it can't just assume all KV is on GPU
The Scoring Module feeds back into the Tier Plan on every forward pass, making placement adaptive mid-request

The inter-system dependencies:

RouteKV sits between the model weights and the serving scheduler
It's orthogonal to batching, scheduling, and sampling — those remain in your existing framework
The Shared KV Store connects multiple requests, meaning placement decisions for one request can affect another

This is why RouteKV uses the compiler metaphor: it's doing global optimization across multiple dimensions simultaneously, much like a compiler does loop invariant code motion, register allocation, and instruction scheduling all at once to produce the best machine code.

Part 11:

Tiering Configs

Tuning RouteKV for Your Workload

RouteKV ships with several YAML config profiles that let you tune the tiering aggressiveness:

Baseline (no RouteKV):

python experiments/benchmark_tiering.py \
  --config configs/llama_1b_baseline.yaml

This runs with standard LRU-style eviction for comparison purposes. Always run this first.

3-Tier Learned (recommended starting point):

python experiments/benchmark_tiering.py \
  --config configs/llama_1b_tier3.yaml

Uses HBM + DRAM + Transformed Store. The cost model drives assignments. Good balance of performance and quality.

Aggressive Compression (for extreme memory pressure):

python experiments/benchmark_tiering.py \
  --config configs/llama_1b_tier3_aggressive.yaml

Pushes more blocks into the Transformed (compressed) store. Maximum memory savings, slightly higher quality risk. Use this when you're hard-constrained on GPU memory.

The config files themselves are human-readable YAML. You can adjust tier thresholds, compression ratios, prefetch window sizes, and sharing policies to match your workload's specific characteristics.

Part 12:

Broader Implications

Let's step back and think about where RouteKV fits in the broader tree of LLM systems research.

Branch 1: Memory Systems RouteKV is fundamentally a memory hierarchy manager. The idea of hot/warm/cold tiering isn't new — it's how operating systems have managed disk and RAM for decades. What's new is applying this thinking at the granularity of individual attention head KV blocks, with learned cost models instead of fixed OS page replacement algorithms.

Branch 2: Compiler Theory The compiler metaphor is genuinely apt. Traditional compilers solve register allocation (which values live in fast registers vs slow memory) using constraint-solving and cost models. RouteKV does the same thing but for GPU HBM vs CPU DRAM vs storage. The JIT aspect — compiling placement plans at request time based on live workload state — mirrors JIT compilers like JVM HotSpot.

Branch 3: LLM Inference Optimization RouteKV connects to the broader effort to make LLM inference economically viable at scale. Flash Attention reduced memory bandwidth. PagedAttention reduced fragmentation. RouteKV aims to reduce the fundamental constraint: the amount of fast memory required per active request. These three optimizations are complementary and could stack.

Branch 4: Context Window Scaling As models push to 1M+ token contexts (see Gemini 1.5, GPT-4 with retrieval), the KV cache problem becomes the bottleneck. RouteKV's tiering approach is one of the key enabling technologies that could make million-token inference practical without a proportional increase in GPU memory.

The cross-branch connection: All four branches connect at one point: making long-context inference economically viable. That's the real impact of RouteKV if it achieves its goals.

Part 13:

Common Questions

Q: Is RouteKV production-ready? Not yet. The README is upfront about this: it's a Research Prototype. APIs are evolving, and the system hasn't been battle-tested at scale. But it's an active research project with a solid architecture, and the foundations are solid.

Q: Does RouteKV require CUDA? Yes. The kernel dispatcher uses custom CUDA and Triton kernels. CPU-only operation isn't supported, which makes sense given the project's focus on GPU inference optimization.

Q: Can I use it with models other than Llama? Conceptually yes — the profiler and cost model are model-agnostic. The tier plan compiler works at the KV block level, which is a standard abstraction. In practice, the APIs are currently designed around Llama-style architectures. Integration with other model families would require some adaptation.

Q: How does the cost model actually work internally?
The cost model is a small neural network trained on profiling traces. It takes features like layer index, token position, context length, and historical attention statistics as inputs and predicts tier assignments. Because it's small, it runs with negligible overhead at inference time.

Q: What happens if the cost model makes a bad prediction?
The scoring module provides an adaptive feedback loop. If a KV block that was sent to DRAM turns out to be heavily accessed, the scoring module can trigger an async prefetch back to HBM for the next layer. It's not perfect, but it's significantly better than static LRU.

Q: Does RouteKV compress the KV cache?
Yes, for blocks assigned to the "Transformed Store" (cold tier). These blocks are compressed or quantized before offloading, reducing the bandwidth required to move them back when needed. The compression method is configurable via the YAML configs.

Part 14:

How to Contribute / Experiment

RouteKV is MIT licensed and open for contributions. If you want to experiment with it:

Run the notebooks — The notebooks/ directory has Colab-ready research notebooks. This is the easiest entry point.
Try different configs — Change the tiering thresholds in the YAML files and observe how they affect the benchmark results.
Swap the cost model — Replace the default cost model with your own (e.g., a tree-based model instead of a neural net) and benchmark the difference.
Profile a new model — Run the profiler on a different base model (Mistral, Phi, Qwen) and see how the access patterns differ from Llama.
Write tests — The tests/ directory has unit and integration tests. New tiering policies, scoring functions, or kernel implementations all need test coverage.

The GitHub repo is at: https://github.com/harsha-mangena/routekv-compiler

The Bottom Line

Let's bring it home.

LLM inference at long contexts is fundamentally a memory management problem in disguise. The compute is there. The algorithms are there. What's breaking down is the movement of KV tensors between memory tiers — a problem that existing systems solve with blunt instruments like LRU eviction and static heuristics.

RouteKV Compiler takes a different bet: treat KV cache placement as a compilation problem. Analyze the workload offline. Train a cost model. At runtime, compile a placement plan that jointly answers where KV lives, how it's encoded, when it moves, and which requests share it. Execute that plan with tier-aware CUDA kernels.

The four questions that RouteKV answers together:

WHERE? — HBM for hot blocks, DRAM for warm, compressed store for cold, shared store for common prefixes
HOW? — Full precision for hot, quantized/compressed for cold
WHEN? — Async prefetch driven by the tier plan, not static layer-ahead windows
WHO? — The shared KV store handles cross-request reuse at the routing level

No existing system answers all four. That's the gap. That's the value proposition.

Is it a silver bullet? No. It's a research prototype with APIs that will change and benchmarks that haven't been independently verified. But the architecture is sound, the compiler analogy is apt, and the problem it's attacking is real and growing more painful as context windows expand.

If you work on LLM infrastructure, this is a project worth watching and

contributing to.

Resources:

GitHub: https://github.com/harsha-mangena/routekv-compiler
License: MIT
Status: Research Prototype (Python 3.10+, CUDA 11.8+)

This blog covers RouteKV Compiler as of its initial public release. The project is under active development and APIs may change.

Command Palette