Example: Deploying LoRA Adapters#

LoRA (Low-Rank Adaptation) adapters are small, efficient fine-tuned models that can be loaded on top of a base model. This allows you to serve multiple specialized behaviors from a single deployment.

Why Use LoRA Adapters?#

  • Parameter Efficiency: LoRA adapters are typically less than 1% of the base model’s size

  • Runtime Adaptation: Switch between different adapters without reloading the base model

  • Simpler MLOps: Centralize inference around one model while supporting multiple use cases

  • Cost Effective: Share expensive base model across multiple specialized tasks

Example: Code Assistant LoRA#

Let’s deploy a base model with multiple LoRA adapters. This will allow the model to switch between general and specialized generation.

For this example, we’ll use publicly available adapters from Hugging Face.

First, we need to prepare our LoRA adapters and save them in our cloud storage.

For example, here is an example script for downloading adapters from Huggingface and saving them in an AWS bucket:

import os
import boto3
from huggingface_hub import snapshot_download

# Mapping of custom names to Hugging Face LoRA adapter repo IDs
adapters = {
    "nemoguard": "nvidia/llama-3.1-nemoguard-8b-topic-control",
    "cv_job_matching": "LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching",
    "yara": "vtriple/Llama-3.1-8B-yara"
}

# S3 target
bucket_name = "llm-docs-aydin"
base_s3_path = "1-5-multi-lora/lora_checkpoints"

# Initialize S3 client
s3 = boto3.client("s3")

for custom_name, repo_id in adapters.items():
    print(f"\n📥 Downloading adapter '{custom_name}' from {repo_id}...")
    local_path = snapshot_download(repo_id)

    print(f"⬆️ Uploading files to s3://{bucket_name}/{base_s3_path}/{custom_name}/")

    for root, _, files in os.walk(local_path):
        for file_name in files:
            local_file_path = os.path.join(root, file_name)
            rel_path = os.path.relpath(local_file_path, local_path)
            s3_key = f"{base_s3_path}/{custom_name}/{rel_path}".replace("\\", "/")

            print(f"  → {s3_key}")
            s3.upload_file(local_file_path, bucket_name, s3_key)

print("\n✅ All adapters uploaded successfully.")

# List all objects in the bucket to confirm
response = s3.list_objects_v2(Bucket=bucket_name)

print(f"Files in s3://{bucket_name}/:")
for obj in response["Contents"]:
    print(obj["Key"])

You should end up with this folder structure for each adapter.

s3://your-bucket/lora-adapters/
├── nemoguard/
│   ├── adapter_config.json
│   └── adapter_model.safetensors
├── cv_job_matching/
│   ├── adapter_config.json
│   └── adapter_model.safetensors
├── yara/
    ├── adapter_config.json
    └── adapter_model.safetensors

Configure Ray Serve LLM with LoRA#

Now let’s configure our LLM with LoRA support. The key additions are the lora_config and enabling LoRA in the engine arguments:

import os
from ray.serve.llm import LLMConfig, build_openai_app

# Configure LLM with LoRA support
llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama",
        # Make sure your huggingface token has access/authorization
        # Go to https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct and request access otherwise
        # Or switch to the unsloth/ version for an ungated LLama 
        model_source="meta-llama/Llama-3.1-8B-Instruct" # Base model
    ),
    accelerator_type="L4",
    # LoRA configuration
    lora_config=dict(
        dynamic_lora_loading_path="s3://llm-docs-aydin/1-5-multi-lora/lora_checkpoints/",  # Your S3/GCS path
        max_num_adapters_per_replica=3  # (optional) Limit adapters per replica
    ),
    runtime_env=dict(
        env_vars={
            "HF_TOKEN": os.environ.get("HF_TOKEN"), # Set your token beforehand: export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>
            "AWS_REGION": "us-west-2"  # Your AWS region
        }
    ),
    engine_kwargs=dict(
        max_model_len=8192,
        # Enable LoRA support
        enable_lora=True,
        max_lora_rank=32,  # Maximum LoRA rank. Set to the largest rank you plan to use.
        max_loras=3,  # Must match max_num_adapters_per_replica
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

Deploy

!serve run serve_my_lora_app:app --non-blocking

Using LoRA Adapters#

Once deployed, you can query different adapters by specifying them in the model name using the format <base_model_id>:<adapter_name>:

#client.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="FAKE_KEY")

############################ Base model request (no adapter) #####################
print("=== Base model ===")
response = client.chat.completions.create(
    model="my-llama",  # no adapter
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")


############################ nemoguard adapter (moderation) #####################
print("=== LoRA: nemoguard ===")
# As per Nemoguard's usage instruction, add this to your system prompt
# https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-topic-control#system-instruction
TOPIC_SAFETY_OUTPUT_RESTRICTION = 'If any of the above conditions are violated, please respond with "off-topic". Otherwise, respond with "on-topic". You must respond with "on-topic" or "off-topic".'
messages_nemoguard = [
    {
        "role": "system",
        "content": f'In the next conversation always use a polite tone and do not engage in any talk about travelling and touristic destinations.{TOPIC_SAFETY_OUTPUT_RESTRICTION}',
    },
    {"role": "user", "content": "Do you know which is the most popular beach in Barcelona?"},
]
#response = client.chat.completions.create(
##    model="my-llama:nemoguard", ### with nemoguard adapter
 #   messages=messages_nemoguard,
 #   stream=True
#)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

############################ cv_job_matching adapter (structured JSON output) ############################
print("=== LoRA: cv_job_matching ===")
messages_cv = [
    {
        "role": "system",
        "content": """You are an advanced AI model designed to analyze the compatibility between a CV and a job description. You will receive a CV and a job description. Your task is to output a structured JSON format that includes the following:

1. matching_analysis: Analyze the CV against the job description to identify key strengths and gaps.
2. description: Summarize the relevance of the CV to the job description in a few concise sentences.
3. score: Provide a numerical compatibility score (0-100) based on qualifications, skills, and experience.
4. recommendation: Suggest actions for the candidate to improve their match or readiness for the role.

Your output must be in JSON format as follows:
{
  "matching_analysis": "Your detailed analysis here.",
  "description": "A brief summary here.",
  "score": 85,
  "recommendation": "Your suggestions here."
}
""",
    },
    {
        "role": "user",
        "content": "<CV> Software engineer with 5 years of experience in Python and cloud infrastructure. </CV>\n<job_description> Looking for a backend engineer with Python and AWS experience. </job_description>",
    },
]
response = client.chat.completions.create(
    model="my-llama:cv_job_matching", ### with cv_job_matching adapter
    messages=messages_cv,
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

############################ yara adapter (cybersecurity task) ############################
print("=== LoRA: yara ===")
messages_yara = [{"role": "user", "content": "Generate a YARA rule to detect a PowerShell-based keylogger. Generate ONLY the YARA rule, do not add explanations."}]
response = client.chat.completions.create(
    model="my-llama:yara", ### with yara adapter
    messages=messages_yara,
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Shutdown the deployment

!serve shutdown -y

Key Benefits#

  • Single Deployment: One base model serves multiple specialized behaviors

  • Dynamic Switching: Change adapters at runtime without restarting

  • Memory Efficient: Adapters are much smaller than full fine-tuned models

  • Cost Effective: Share expensive base model across multiple use cases

Learn More#

For comprehensive multi-LoRA deployment guides, see: