Evaluating LLM Responses with Judges Library

Introduction
With the rapid adoption of Large Language Models (LLMs) in various applications, ensuring the quality of their responses is essential. Judges Library is a lightweight Python package designed to evaluate AI-generated responses based on correctness, clarity, and bias. Whether you're fine-tuning an LLM or assessing its reliability, this library provides LLM-as-a-Judge tools to automate and enhance response evaluation.
In this tutorial, you'll learn how to:
- Install and set up Judges Library
- Generate LLM responses for testing
- Evaluate responses using classifier judges, jury systems, and AutoJudge
- Leverage multi-model support for more diverse evaluation results
Installing Judges Library and Dependencies
Before using Judges Library, install the necessary dependencies:
pip install judges "judges[auto]" instructor
To use LLM-based evaluators, you need to configure API keys. For instance, in Google Colab:
from google.colab import userdata import os os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
Generating an LLM Response for Evaluation
To test Judges Library, we first generate an AI response using OpenAI's GPT models. Below is a sample story-based question:
from openai import OpenAI client = OpenAI() story = """ Fig was a small, scruffy dog with a big personality. One day, he met a rabbit in the woods and they became friends. """ question = "What is the name of the rabbit in the story?" expected = "I don't know" input_text = f'{story}\n\nQuestion: {question}' output = client.chat.completions.create( model='gpt-4o-mini', messages=[{'role': 'user', 'content': input_text}] ).choices[0].message.content
The model should respond with "I don't know" since the rabbit's name is not mentioned. Now, let's evaluate this response.
Evaluating AI Output with Judges Library
Using a Classifier Judge
Classifier Judges provide boolean evaluations (True/False, Good/Bad) based on predefined correctness criteria. Below, we use PollMultihopCorrectness
to evaluate if the generated output matches expectations.
from judges.classifiers.correctness import PollMultihopCorrectness correctness = PollMultihopCorrectness(model='gpt-4o-mini') judgment = correctness.judge(input=input_text, output=output, expected=expected) print(judgment.reasoning) print(judgment.score)
Expected Output:
The provided answer matches the reference answer exactly, indicating a correct response. True
Using a Jury System for Diversified Evaluation
A jury system combines multiple evaluators for a more balanced judgment. Here, we use PollMultihopCorrectness
and RAFTCorrectness
to assess an LLM response.
from judges import Jury from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness poll = PollMultihopCorrectness(model='gpt-4o') raft = RAFTCorrectness(model='gpt-4o-mini') jury = Jury(judges=[poll, raft], voting_method="average") verdict = jury.vote(input=input_text, output=output, expected=expected) print(verdict.score)
Expected Output:
0.5 # Average score from multiple judges
Creating Custom AI Evaluators with AutoJudge
AutoJudge allows you to create custom AI evaluators based on labeled datasets. Below, we initialize AutoJudge with a dataset containing labeled AI responses.
from judges.classifiers.auto import AutoJudge dataset = [ { "input": "Can I ride a dragon in Scotland?", "output": "Yes, dragons are common in the highlands.", "label": 0, "feedback": "Dragons are mythical creatures; the information is fictional." }, { "input": "Can you recommend a hotel in Tokyo?", "output": "Hotel Sunroute Plaza Shinjuku is highly rated.", "label": 1, "feedback": "Provides a specific and useful recommendation." } ] task = "Evaluate responses for accuracy, clarity, and helpfulness." autojudge = AutoJudge.from_dataset(dataset=dataset, task=task, model="gpt-4-turbo-2024-04-09")
Now, let’s evaluate a new AI-generated response:
input_ = "What are the top attractions in New York City?" output = "Some top attractions in NYC include the Statue of Liberty and Central Park." judgment = autojudge.judge(input=input_, output=output) print(judgment.reasoning) print(judgment.score)
Expected Output:
The response meets accuracy, clarity, and helpfulness criteria. The provided attractions are factual and relevant. True
Conclusion
Judges Library is a powerful framework for evaluating LLM-generated responses with various judging mechanisms, including classifier judges, jury systems, and AutoJudge. It provides a structured way to assess AI outputs based on correctness, clarity, and helpfulness, ensuring higher reliability and reduced bias in AI applications.
Key Takeaways:
- Classifier Judges offer Boolean evaluations.
- Jury System allows multiple models to contribute to an evaluation.
- AutoJudge enables custom AI-powered assessments based on labeled datasets.
- The library supports OpenAI and LiteLLM models for flexibility.
References
---------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.
---------------------------
Resources and Community
Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, our resources will help you understand and implement Generative AI in your projects.
- Website: www.buildfastwithai.com
- LinkedIn: linkedin.com/company/build-fast-with-ai/
- Instagram: instagram.com/buildfastwithai/
- Twitter: x.com/satvikps
- Telegram: t.me/BuildFastWithAI