buildfastwithaibuildfastwithai
AI WorkshopsAll blogsAgentic AI Launchpad
Agentic AI Launchpad
Download Unrot App
Free AI Workshop
Mentorship

Agentic AI Launchpad

Go from user to builder in 6 weeks.

Explore Program
Share
Back to blogs
Tools
Tutorials

Evaluating LLM Responses with Judges Library

March 13, 2025
4 min read
Share:
Evaluating LLM Responses with Judges Library
Share:

Introduction

With the rapid adoption of Large Language Models (LLMs) in various applications, ensuring the quality of their responses is essential. Judges Library is a lightweight Python package designed to evaluate AI-generated responses based on correctness, clarity, and bias. Whether you're fine-tuning an LLM or assessing its reliability, this library provides LLM-as-a-Judge tools to automate and enhance response evaluation.

In this tutorial, you'll learn how to:

  • Install and set up Judges Library
  • Generate LLM responses for testing
  • Evaluate responses using classifier judges, jury systems, and AutoJudge
  • Leverage multi-model support for more diverse evaluation results

Installing Judges Library and Dependencies

Before using Judges Library, install the necessary dependencies:

pip install judges "judges[auto]" instructor

To use LLM-based evaluators, you need to configure API keys. For instance, in Google Colab:

from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Generating an LLM Response for Evaluation

To test Judges Library, we first generate an AI response using OpenAI's GPT models. Below is a sample story-based question:

from openai import OpenAI

client = OpenAI()

story = """
Fig was a small, scruffy dog with a big personality. One day, he met a rabbit in the woods and they became friends.
"""

question = "What is the name of the rabbit in the story?"
expected = "I don't know"

input_text = f'{story}\n\nQuestion: {question}'

output = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': input_text}]
).choices[0].message.content

The model should respond with "I don't know" since the rabbit's name is not mentioned. Now, let's evaluate this response.

Evaluating AI Output with Judges Library

Using a Classifier Judge

Classifier Judges provide boolean evaluations (True/False, Good/Bad) based on predefined correctness criteria. Below, we use PollMultihopCorrectness to evaluate if the generated output matches expectations.

from judges.classifiers.correctness import PollMultihopCorrectness

correctness = PollMultihopCorrectness(model='gpt-4o-mini')

judgment = correctness.judge(input=input_text, output=output, expected=expected)

print(judgment.reasoning)
print(judgment.score)

Expected Output:

The provided answer matches the reference answer exactly, indicating a correct response.
True

Using a Jury System for Diversified Evaluation

A jury system combines multiple evaluators for a more balanced judgment. Here, we use PollMultihopCorrectness and RAFTCorrectness to assess an LLM response.

from judges import Jury
from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness

poll = PollMultihopCorrectness(model='gpt-4o')
raft = RAFTCorrectness(model='gpt-4o-mini')

jury = Jury(judges=[poll, raft], voting_method="average")

verdict = jury.vote(input=input_text, output=output, expected=expected)
print(verdict.score)

Expected Output:

0.5  # Average score from multiple judges

Creating Custom AI Evaluators with AutoJudge

AutoJudge allows you to create custom AI evaluators based on labeled datasets. Below, we initialize AutoJudge with a dataset containing labeled AI responses.

from judges.classifiers.auto import AutoJudge

dataset = [
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are common in the highlands.",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional."
    },
    {
        "input": "Can you recommend a hotel in Tokyo?",
        "output": "Hotel Sunroute Plaza Shinjuku is highly rated.",
        "label": 1,
        "feedback": "Provides a specific and useful recommendation."
    }
]

task = "Evaluate responses for accuracy, clarity, and helpfulness."

autojudge = AutoJudge.from_dataset(dataset=dataset, task=task, model="gpt-4-turbo-2024-04-09")

Now, let’s evaluate a new AI-generated response:

input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."

judgment = autojudge.judge(input=input_, output=output)

print(judgment.reasoning)
print(judgment.score)

Expected Output:

The response meets accuracy, clarity, and helpfulness criteria. The provided attractions are factual and relevant.
True
🚀 Cohort Waitlist Open
Go From AI User to AI Builder

Don't just use ChatGPT. Learn to build custom LLM agents, RAG pipelines, and full-stack Agentic AI apps in our intensive 6-week program.

6 Weeks Live Mentorship
Deploy 5+ Real-world Apps
Weekly App Templates & Code
No Coding Experience Required
Explore Program
Join 1,000+ graduates•Free Registration

Conclusion

Judges Library is a powerful framework for evaluating LLM-generated responses with various judging mechanisms, including classifier judges, jury systems, and AutoJudge. It provides a structured way to assess AI outputs based on correctness, clarity, and helpfulness, ensuring higher reliability and reduced bias in AI applications.

Key Takeaways:

  • Classifier Judges offer Boolean evaluations.
  • Jury System allows multiple models to contribute to an evaluation.
  • AutoJudge enables custom AI-powered assessments based on labeled datasets.
  • The library supports OpenAI and LiteLLM models for flexibility.

References

  1. OpenAI API Documentation
  2. Judge Library Experiment Notebook

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, our resources will help you understand and implement Generative AI in your projects.

  • Website: www.buildfastwithai.com
  • LinkedIn: linkedin.com/company/build-fast-with-ai/
  • Instagram: instagram.com/buildfastwithai/
  • Twitter: x.com/satvikps
  • Telegram: t.me/BuildFastWithAI


Enjoyed this article? Share it →
Share:
    You Might Also Like
    Tiktoken: High-Performance Tokenizer for OpenAI Models
    Tools
    Tiktoken: High-Performance Tokenizer for OpenAI Models

    Unlock the power of tokenization with Tiktoken! Learn how this high-performance library helps you efficiently tokenize text for OpenAI models like GPT. From setup to encoding, decoding, and token management, discover how Tiktoken can optimize your AI projects.

    How FAISS is Revolutionizing Vector Search: Everything You Need to Know
    Tools
    How FAISS is Revolutionizing Vector Search: Everything You Need to Know

    Discover FAISS, the ultimate library for fast similarity search and clustering of dense vectors! This in-depth guide covers setup, vector stores, document management, similarity search, and real-world applications. Master FAISS to build scalable, AI-powered search systems efficiently! 🚀