How to Choose an LLM?#

With so many models available, choosing the right one for your use case is crucial. Here’s a practical framework for model selection based on the Anyscale documentation.

Model Selection Framework#

1. Model Quality Benchmarks#

Use established benchmarks to evaluate model capabilities:

  • Chatbot Arena: For conversational capabilities and user preference

  • MMLU-Pro: For domain-specific performance across academic subjects

  • Code Benchmarks: For programming and code generation tasks

  • Reasoning Tests: For logical reasoning and problem-solving

2. Task and Domain Alignment#

Match your model to your specific use case:

Model Type

Best For

Example Use Cases

Base Models

Next-token prediction, open-ended continuation

Sentence completion, code autocomplete

Instruction-tuned

Following explicit directions

Chatbots, coding assistants, Q&A

Reasoning-optimized

Complex problem-solving

Mathematical reasoning, scientific analysis

3. Context Window Requirements#

Match context length to your use case:

Context Length

Use Cases

Memory Impact

4K-8K tokens

Q&A, simple chat

Low memory requirements

32K-128K tokens

Document analysis, summarization

Moderate memory usage

128K+ tokens

Multi-step agents, complex reasoning

High memory requirements

4. Hardware and Cost Considerations#

Balance performance with resource constraints:

  • Small Models (7B-13B): 1-2 GPUs, fast deployment, lower cost

  • Medium Models (70B-80B): 4-8 GPUs, balanced performance/cost

  • Large Models (400B+): Multiple nodes, maximum capability, higher cost

Practical Selection Process#

  1. Define Requirements: Latency, accuracy, context length, budget

  2. Benchmark Models: Test on your specific tasks and data

  3. Consider Trade-offs: Speed vs. accuracy, cost vs. capability

  4. Start Simple: Begin with smaller models, scale up as needed

  5. Iterate and Optimize: Monitor performance and adjust accordingly

Model Recommendations by Use Case#

For Production Chatbots:

  • Llama 3.1 8B/70B (balanced performance)

  • Mistral 7B (fast inference)

For Code Generation:

  • Code Llama 7B/13B (specialized for code)

  • DeepSeek-Coder (reasoning + code)

For Complex Reasoning:

  • Qwen 3 32B (hybrid thinking)

  • DeepSeek-R1 (dedicated reasoning)

For Document Processing:

  • Llama 3.1 70B (large context)

  • Claude 3.5 Sonnet (excellent long context)