Local Deployment & Inference

Local Deployment & Inference#

Now let’s deploy our medium-sized LLM locally and query it.

Prerequisites#

Hardware Requirements:

Access to 4-8 GPUs (L40S, A100-40G, or similar with sufficient GPU memory for the 70B model (~140GB))

Software Requirements:

Ray Serve LLM
For gated models, an Hugging Face token with authorization to access the model

Installation:

pip install "ray[serve,llm]"

Hugging Face Token:

export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>

Launching Ray Serve#

Let’s start our LLM service:

!serve run serve_llama_3_1_70b:app --non-blocking

Sending Requests#

Once deployed, your endpoint is available at http://localhost:8000. You can use a placeholder authentication token like "FAKE_KEY".

Let’s test our model with some example requests:

from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

response = client.chat.completions.create(
    model="my-llama-3.1-70b",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Shutting Down#

When you’re done testing, shut down the service:

!serve shutdown -y