Local Deployment & Inference#
Now let’s deploy our medium-sized LLM locally and query it.
Prerequisites#
Hardware Requirements:
Access to 4-8 GPUs (L40S, A100-40G, or similar with sufficient GPU memory for the 70B model (~140GB))
Software Requirements:
Ray Serve LLM
For gated models, an Hugging Face token with authorization to access the model
Installation:
pip install "ray[serve,llm]"
Hugging Face Token:
export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>
Launching Ray Serve#
Let’s start our LLM service:
!serve run serve_llama_3_1_70b:app --non-blocking
Sending Requests#
Once deployed, your endpoint is available at http://localhost:8000. You can use a placeholder authentication token like "FAKE_KEY".
Let’s test our model with some example requests:
from urllib.parse import urljoin
from openai import OpenAI
API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"
client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)
response = client.chat.completions.create(
model="my-llama-3.1-70b",
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Shutting Down#
When you’re done testing, shut down the service:
!serve shutdown -y