buildfastwithaibuildfastwithai
GenAI LaunchpadAI WorkshopsAll blogs

Supercharge LLM Inference with vLLM

February 14, 2025
5 min read
815 views
Supercharge LLM Inference with vLLM

Ship Your First AI App

From zero to deployed app with our Gen AI Launchpad

Start Building Today

What’s Your AI Score?

Answer a few questions and get a personalized AI roadmap for your role and goals.

Is Your Resume AI-Ready?

Check your resume ATS score and get instant AI-powered improvement suggestions.

Are you hesitating while the next big breakthrough happens?

Don’t wait—be part of Gen AI Launch Pad 2025 and make history.

Introduction

Large Language Models (LLMs) are at the forefront of AI-driven applications, but running them efficiently remains a challenge due to their high computational and memory requirements. vLLM is a powerful, optimized inference engine designed to enhance the speed and efficiency of LLM execution. This blog provides a comprehensive guide to using vLLM, covering installation, model loading, text generation, batch processing, embeddings, and text classification.

By the end of this article, you will:

  • Understand how to install and set up vLLM.
  • Learn how to load and use LLMs efficiently with vLLM.
  • Explore batch processing for handling multiple prompts simultaneously.
  • Generate embeddings and perform text classification using vLLM.

Installation and Setup

Before using vLLM, install the library with the following command:

!pip install vllm

This command installs the necessary dependencies to start working with vLLM.

Initializing and Using vLLM

Loading a Model

To begin, load an LLM using vLLM. Here’s how you can load OPT-125M from Facebook’s model collection:

from vllm import LLM

llm = LLM(model="facebook/opt-125m")

This initializes an instance of the LLM, making it ready for inference.

Configuring Sampling Parameters

Sampling parameters control the randomness and diversity of text generation. Here’s how you can configure them:

from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.8,  # Controls randomness; higher means more variation
    top_p=0.95,       # Nucleus sampling; limits generated tokens
    max_tokens=256    # Maximum number of tokens in output
)

These settings influence the model’s output diversity and length.

Generating Text with vLLM

Now that the model is loaded and configured, let’s generate text from different prompts:

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Expected Output

Prompt: 'Hello, my name is', Generated text: 'Alice and I love AI research.'
Prompt: 'The capital of France is', Generated text: 'Paris, a city known for its rich history and culture.'
Prompt: 'The future of AI is', Generated text: 'full of possibilities, revolutionizing industries worldwide.'

This demonstrates how vLLM efficiently generates coherent and contextually relevant text.

Batch Processing for Large Workloads

vLLM supports batch processing, enabling multiple prompts to be processed in parallel, improving efficiency.

prompts = [
    "What is the meaning of life?",
    "Write a short story about a cat.",
    "Translate 'hello' to Spanish.",
    "What is the capital of Japan?",
    "Explain the theory of relativity.",
    "Write a poem about the ocean.",
    "What is the highest mountain in the world?",
    "Write a Python function to calculate the factorial of a number.",
] * 10  # Expanding to 80 prompts

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Expected Performance Output

Processed prompts: 100%|██████████| 80/80 [00:02<00:00, 37.47it/s, est. speed input: 346.85 toks/s, output: 3708.88 toks/s]

This output indicates efficient batch processing, with high token throughput.

Generating Embeddings with vLLM

Embeddings convert text into numerical vectors, useful for NLP tasks like similarity comparison and clustering.

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

model = LLM(
    model="facebook/opt-125m",
    task="embed",
    enforce_eager=True,
)

outputs = model.embed(prompts)

for prompt, output in zip(prompts, outputs):
    embeds = output.outputs.embedding
    embeds_trimmed = ((str(embeds[:16])[:-1] + ", ...]") if len(embeds) > 16 else embeds)
    print(f"Prompt: {prompt!r} | Embeddings: {embeds_trimmed} (size={len(embeds)})")

Expected Output

Prompt: 'Hello, my name is' | Embeddings: [0.024, -0.017, 0.152, ..., 0.101] (size=768)

Text Classification with vLLM

Text classification categorizes input text into predefined classes.

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

model = LLM(
    model="facebook/opt-125m",
    task="classify",
    enforce_eager=True,
)

outputs = model.classify(prompts)

for prompt, output in zip(prompts, outputs):
    probs = output.outputs.probs
    probs_trimmed = ((str(probs[:16])[:-1] + ", ...]") if len(probs) > 16 else probs)
    print(f"Prompt: {prompt!r} | Class Probabilities: {probs_trimmed} (size={len(probs)})")

Expected Output

Prompt: 'The capital of France is' | Class Probabilities: [0.89, 0.05, 0.02, ...] (size=5)

Conclusion

vLLM is a powerful tool for fast and efficient LLM inference. Key takeaways:

  • It significantly improves speed and reduces memory usage.
  • Supports batch processing, real-time streaming, text generation, embeddings, and classification.
  • Open-source and easy to integrate into NLP pipelines.

Resources

  • vLLM GitHub Repository
  • Hugging Face Model Hub
  • vLLM Notebook

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

  • Website: www.buildfastwithai.com
  • LinkedIn: linkedin.com/company/build-fast-with-ai/
  • Instagram: instagram.com/buildfastwithai/
  • Twitter: x.com/satvikps
  • Telegram: t.me/BuildFastWithAI

AI That Keeps You Ahead

Get the latest AI insights, tools, and frameworks delivered to your inbox. Join builders who stay ahead of the curve.

Personalized Growth Engine

What’s your AI Score?

Measure your AI readiness and unlock a personalized roadmap with curated tools, frameworks, and resources tailored to your role.

✔ Takes 2 minutes✔ Free forever✔ Actionable advice