Supercharge LLM Inference with vLLM

Are you hesitating while the next big breakthrough happens?
Don’t wait—be part of Gen AI Launch Pad 2025 and make history.
Introduction
Large Language Models (LLMs) are at the forefront of AI-driven applications, but running them efficiently remains a challenge due to their high computational and memory requirements. vLLM is a powerful, optimized inference engine designed to enhance the speed and efficiency of LLM execution. This blog provides a comprehensive guide to using vLLM, covering installation, model loading, text generation, batch processing, embeddings, and text classification.
By the end of this article, you will:
- Understand how to install and set up vLLM.
- Learn how to load and use LLMs efficiently with vLLM.
- Explore batch processing for handling multiple prompts simultaneously.
- Generate embeddings and perform text classification using vLLM.
Installation and Setup
Before using vLLM, install the library with the following command:
!pip install vllm
This command installs the necessary dependencies to start working with vLLM.
Initializing and Using vLLM
Loading a Model
To begin, load an LLM using vLLM. Here’s how you can load OPT-125M from Facebook’s model collection:
from vllm import LLM llm = LLM(model="facebook/opt-125m")
This initializes an instance of the LLM, making it ready for inference.
Configuring Sampling Parameters
Sampling parameters control the randomness and diversity of text generation. Here’s how you can configure them:
from vllm import SamplingParams sampling_params = SamplingParams( temperature=0.8, # Controls randomness; higher means more variation top_p=0.95, # Nucleus sampling; limits generated tokens max_tokens=256 # Maximum number of tokens in output )
These settings influence the model’s output diversity and length.
Generating Text with vLLM
Now that the model is loaded and configured, let’s generate text from different prompts:
prompts = [ "Hello, my name is", "The capital of France is", "The future of AI is", ] outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Expected Output
Prompt: 'Hello, my name is', Generated text: 'Alice and I love AI research.' Prompt: 'The capital of France is', Generated text: 'Paris, a city known for its rich history and culture.' Prompt: 'The future of AI is', Generated text: 'full of possibilities, revolutionizing industries worldwide.'
This demonstrates how vLLM efficiently generates coherent and contextually relevant text.
Batch Processing for Large Workloads
vLLM supports batch processing, enabling multiple prompts to be processed in parallel, improving efficiency.
prompts = [ "What is the meaning of life?", "Write a short story about a cat.", "Translate 'hello' to Spanish.", "What is the capital of Japan?", "Explain the theory of relativity.", "Write a poem about the ocean.", "What is the highest mountain in the world?", "Write a Python function to calculate the factorial of a number.", ] * 10 # Expanding to 80 prompts outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Expected Performance Output
Processed prompts: 100%|██████████| 80/80 [00:02<00:00, 37.47it/s, est. speed input: 346.85 toks/s, output: 3708.88 toks/s]
This output indicates efficient batch processing, with high token throughput.
Generating Embeddings with vLLM
Embeddings convert text into numerical vectors, useful for NLP tasks like similarity comparison and clustering.
prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] model = LLM( model="facebook/opt-125m", task="embed", enforce_eager=True, ) outputs = model.embed(prompts) for prompt, output in zip(prompts, outputs): embeds = output.outputs.embedding embeds_trimmed = ((str(embeds[:16])[:-1] + ", ...]") if len(embeds) > 16 else embeds) print(f"Prompt: {prompt!r} | Embeddings: {embeds_trimmed} (size={len(embeds)})")
Expected Output
Prompt: 'Hello, my name is' | Embeddings: [0.024, -0.017, 0.152, ..., 0.101] (size=768)
Text Classification with vLLM
Text classification categorizes input text into predefined classes.
prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] model = LLM( model="facebook/opt-125m", task="classify", enforce_eager=True, ) outputs = model.classify(prompts) for prompt, output in zip(prompts, outputs): probs = output.outputs.probs probs_trimmed = ((str(probs[:16])[:-1] + ", ...]") if len(probs) > 16 else probs) print(f"Prompt: {prompt!r} | Class Probabilities: {probs_trimmed} (size={len(probs)})")
Expected Output
Prompt: 'The capital of France is' | Class Probabilities: [0.89, 0.05, 0.02, ...] (size=5)
Conclusion
vLLM is a powerful tool for fast and efficient LLM inference. Key takeaways:
- It significantly improves speed and reduces memory usage.
- Supports batch processing, real-time streaming, text generation, embeddings, and classification.
- Open-source and easy to integrate into NLP pipelines.
Resources
---------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.
---------------------------
Resources and Community
Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.
- Website: www.buildfastwithai.com
- LinkedIn: linkedin.com/company/build-fast-with-ai/
- Instagram: instagram.com/buildfastwithai/
- Twitter: x.com/satvikps
- Telegram: t.me/BuildFastWithAI