What is LLM Serving?#

Large Language Model (LLM) serving refers to the process of deploying trained language models to production environments where they can handle user requests and generate responses in real-time. This is fundamentally different from training models - serving focuses on making models available, scalable, and performant for end users.

The LLM Text Generation Process#

LLMs operate as next-token predictors. Here’s how they work:

  1. Tokenization: Input text is converted into tokens (words, subwords, or characters)

  2. Processing: The model processes these tokens to understand context

  3. Generation: The model generates output one token at a time

  4. Completion: Generation stops when reaching stopping criteria or maximum length

Two Phases of LLM Inference#

LLM inference operates through two distinct phases that determine performance characteristics:

Prefill Phase#

  • The model encodes all input tokens simultaneously

  • High efficiency through parallelized computations

  • Maximizes GPU utilization

  • Precomputes and caches key-value (KV) vectors as intermediate token representations

Decode Phase#

  • The model generates tokens sequentially using the key-value cache (KV cache)

  • Each token depends on all previous tokens

  • Limited by memory bandwidth rather than compute capacity

  • Underutilizes GPU resources compared to prefill phase

prefill: parallel processing of prompt tokens, decode: sequential processing of single output tokens.