Advanced Topics: Monitoring & Optimization#

Now let’s explore advanced features for production deployments.

Enabling LLM Monitoring#

The Serve LLM Dashboard offers deep visibility into model performance. Let’s enable comprehensive monitoring:

# serve_llama_3_1_70b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama-3.1-70b",
        model_source="meta-llama/Llama-3.1-70B-Instruct",
    ),
    accelerator_type="L40S",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=4,
        )
    ),
    runtime_env=dict(
        env_vars={
            "HF_TOKEN": os.environ.get("HF_TOKEN"),
        }
    ),
    engine_kwargs=dict(
        max_model_len=32768,
        tensor_parallel_size=8,
    ),
    # Enable detailed engine metrics
    log_engine_metrics=True
)

app = build_openai_app({"llm_configs": [llm_config]})

Anyscale provides an easy way to visualize your LLM metrics on an integrated Grafana dashboard.

On your Anyscale Workspace or Service page, go to Metrics, then click on the View on Grafana dropdown and select Ray Serve LLM Dashboard.

!serve run serve_llama_3_1_70b:app --non-blocking

Remember shutting down your service

!serve shutdown -y

Improving Concurrency#

Ray Serve LLM uses vLLM as its backend engine, which logs the maximum concurrency it can support.
Example log for 8xL40S:

INFO 08-19 20:57:37 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 17.79x

Let’s explore optimization strategies:

Concurrency Optimization Strategies#

Below are key strategies to improve model concurrency and performance when serving LLMs.


Example log (8×L40S setup):

INFO: Maximum concurrency for 32,768 tokens per request: 17.79x

1. Reduce max_model_len#

  • 32,768 tokens → concurrency ≈ 18

  • 16,384 tokens → concurrency ≈ 36

  • Trade-off: shorter context window but higher concurrency


2. Use Quantized Models#

  • FP16 → FP8: ~50% memory reduction

  • FP8 → INT4: ~75% memory reduction

  • Frees up memory for the KV cache, enabling more concurrent requests


3. Enable Pipeline Parallelism#

  • Distribute layers across multiple nodes, set pipeline_parallel_size > 1

  • This increase the size of your KV cache, trading off on your latency due to the multi-node communication overhead


4. Scale with More Replicas#

  • Horizontally scale across multiple nodes

  • Each replica runs an independent model instance

  • Total concurrency = per-replica concurrency × number of replicas


5. Upgrade Hardware#

  • Example: L40S (48 GB) → A100 (80 GB)

  • More GPU memory allows higher concurrency

  • Faster interconnects (e.g., NVLink) reduce latency