Evaluating LLM Responses with Judges Library

Introduction

With the rapid adoption of Large Language Models (LLMs) in various applications, ensuring the quality of their responses is essential. Judges Library is a lightweight Python package designed to evaluate AI-generated responses based on correctness, clarity, and bias. Whether you're fine-tuning an LLM or assessing its reliability, this library provides LLM-as-a-Judge tools to automate and enhance response evaluation.

In this tutorial, you'll learn how to:

Install and set up Judges Library
Generate LLM responses for testing
Evaluate responses using classifier judges, jury systems, and AutoJudge
Leverage multi-model support for more diverse evaluation results

Installing Judges Library and Dependencies

Before using Judges Library, install the necessary dependencies:

pip install judges "judges[auto]" instructor

To use LLM-based evaluators, you need to configure API keys. For instance, in Google Colab:

from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Generating an LLM Response for Evaluation

To test Judges Library, we first generate an AI response using OpenAI's GPT models. Below is a sample story-based question:

from openai import OpenAI

client = OpenAI()

story = """
Fig was a small, scruffy dog with a big personality. One day, he met a rabbit in the woods and they became friends.
"""

question = "What is the name of the rabbit in the story?"
expected = "I don't know"

input_text = f'{story}\n\nQuestion: {question}'

output = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': input_text}]
).choices[0].message.content

The model should respond with "I don't know" since the rabbit's name is not mentioned. Now, let's evaluate this response.

Evaluating AI Output with Judges Library

Using a Classifier Judge

Classifier Judges provide boolean evaluations (True/False, Good/Bad) based on predefined correctness criteria. Below, we use PollMultihopCorrectness to evaluate if the generated output matches expectations.

from judges.classifiers.correctness import PollMultihopCorrectness

correctness = PollMultihopCorrectness(model='gpt-4o-mini')

judgment = correctness.judge(input=input_text, output=output, expected=expected)

print(judgment.reasoning)
print(judgment.score)

Expected Output:

The provided answer matches the reference answer exactly, indicating a correct response.
True

Using a Jury System for Diversified Evaluation

A jury system combines multiple evaluators for a more balanced judgment. Here, we use PollMultihopCorrectness and RAFTCorrectness to assess an LLM response.

from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness

poll = PollMultihopCorrectness(model='gpt-4o')
raft = RAFTCorrectness(model='gpt-4o-mini')

jury = Jury(judges=[poll, raft], voting_method="average")

verdict = jury.vote(input=input_text, output=output, expected=expected)
print(verdict.score)

Expected Output:

0.5  # Average score from multiple judges

Creating Custom AI Evaluators with AutoJudge

AutoJudge allows you to create custom AI evaluators based on labeled datasets. Below, we initialize AutoJudge with a dataset containing labeled AI responses.

from judges.classifiers.auto import AutoJudge

dataset = [
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are common in the highlands.",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional."
    },
    {
        "input": "Can you recommend a hotel in Tokyo?",
        "output": "Hotel Sunroute Plaza Shinjuku is highly rated.",
        "label": 1,
        "feedback": "Provides a specific and useful recommendation."
    }
]

task = "Evaluate responses for accuracy, clarity, and helpfulness."

autojudge = AutoJudge.from_dataset(dataset=dataset, task=task, model="gpt-4-turbo-2024-04-09")

Now, let’s evaluate a new AI-generated response:

input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."

judgment = autojudge.judge(input=input_, output=output)

print(judgment.reasoning)
print(judgment.score)

Expected Output:

The response meets accuracy, clarity, and helpfulness criteria. The provided attractions are factual and relevant.
True

Conclusion

Judges Library is a powerful framework for evaluating LLM-generated responses with various judging mechanisms, including classifier judges, jury systems, and AutoJudge. It provides a structured way to assess AI outputs based on correctness, clarity, and helpfulness, ensuring higher reliability and reduced bias in AI applications.

Key Takeaways:

Classifier Judges offer Boolean evaluations.
Jury System allows multiple models to contribute to an evaluation.
AutoJudge enables custom AI-powered assessments based on labeled datasets.
The library supports OpenAI and LiteLLM models for flexibility.

References

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, our resources will help you understand and implement Generative AI in your projects.

Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai/
Instagram: instagram.com/buildfastwithai/
Twitter: x.com/satvikps
Telegram: t.me/BuildFastWithAI

BuildFast Bot

Educhain

BuildFast Studio

BuildFast Bot

Educhain

BuildFast Studio

Evaluating LLM Responses with Judges Library

Introduction

Installing Judges Library and Dependencies

Generating an LLM Response for Evaluation

Evaluating AI Output with Judges Library

Using a Classifier Judge

Expected Output:

Using a Jury System for Diversified Evaluation

Expected Output:

Creating Custom AI Evaluators with AutoJudge

Expected Output:

Conclusion

Key Takeaways:

References

Resources and Community