Key Concepts and Optimizations#

These key concepts will help you design an LLM serving pipeline that meets your service level objectives (SLOs).

1. Key-Value (KV) Caching#

KV caching eliminates redundant computations during text generation:

Without KV Cache:

  • Recalculate keys and values for entire sequence each time

  • Extremely inefficient for long sequences

With KV Cache:

  • Cache computed K and V values for all previous tokens

  • Only compute K and V for the new token

  • Reuse cached values for context

2. Continuous Batching#

Continuous batching optimizes throughput by eliminating GPU idle time:

Vanilla Static Batching:

  • Wait for all requests in batch to complete

  • Creates idle time when requests finish at different rates

  • Underutilizes GPU resources

Completing four sequences using static batching. On the first iteration (left), each sequence generates one token (blue) from the prompt tokens (yellow). After several iterations (right), the completed sequences each have different sizes because each emits their end-of-sequence-token (red) at different iterations. Even though sequence 3 finished after two iterations, static batching means that the GPU will be underutilized until the last sequence in the batch finishes generation (in this example, sequence 2 after six iterations).

Continuous Batching:

  • Immediately replace completed requests with new ones

  • Maintains constant GPU utilization

  • Increases concurrent user capacity

Completing seven sequences using continuous batching. Left shows the batch after a single iteration, right shows the batch after several iterations. Once a sequence emits an end-of-sequence token, we insert a new sequence in its place (i.e. sequences S5, S6, and S7). This achieves higher GPU utilization since the GPU does not wait for all sequences to complete before starting a new one.

3. Model parallelization or alternatives#

Large LLMs (>70B) might provides more accurate answers but might not fit entirely on one GPU or one node. You can parallelize your model accross multiple GPUs or nodes to virtually increase your memory resources at the cost of some latency due to communication overhead.

You can also use alternative options such as quantization, distillation, or multi-LoRA adapters to

4. Context Window Considerations#

The context window defines the maximum tokens a model can process:

Context Length

Use Cases

Memory Impact

4K-8K tokens

Q&A, simple chat

Low KV cache requirements

32K-128K tokens

Document analysis, summarization

Moderate memory usage

128K+ tokens

Multi-step agents, complex reasoning

High memory requirements

A large context window might provide more accurate answers but also increase the memory pressure and how many requests can be processed concurrently.