Ragas: Evaluation Framework for RAG Systems

Tomorrow’s leaders are building AI today. Are you one of them?

Retrieval-Augmented Generation (RAG) systems have emerged as a transformative technology in artificial intelligence, combining the strengths of retrieval and generative models. However, evaluating their performance effectively has remained a challenge. This blog post delves into Ragas, an open-source evaluation framework designed to address this gap, offering developers tools to analyze and optimize their RAG workflows.

Introduction

Ragas provides a structured approach to assess the quality of RAG systems by focusing on two key components: retrieval and generation. With support for metrics like precision, recall, response coherence, and more, Ragas helps developers fine-tune their systems to deliver accurate and reliable results. This blog will guide you through the following:

Understanding the setup and installation of Ragas.
Building a simple question-answering (QA) application using LangChain and OpenAI models.
Creating evaluation datasets and utilizing Ragas metrics to analyze performance.
Insights into advanced evaluation techniques and real-world applications.

By the end of this post, you will be equipped to leverage Ragas for enhancing your RAG systems.

Getting Started with Ragas

Setup and Installation

To begin, install Ragas and its dependencies using the following commands:

pip install ragas sacrebleu langchain-openai
pip install git+https://github.com/explodinggradients/ragas.git

Set up your API keys for seamless integration:

from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Key Components

Ragas integrates seamlessly with popular RAG frameworks like LangChain, enabling easy evaluation and optimization. For this walkthrough, we'll build a simple QA application using LangChain's tools and OpenAI models.

Building a Simple QA Application

Data Preparation

The foundation of any QA system is high-quality data. Here's an example dataset containing brief biographies of prominent AI leaders:

from langchain_core.documents import Document

content_list = [
    "Andrew Ng is the CEO of Landing AI and is known for his pioneering work in deep learning. He is also widely recognized for democratizing AI education through platforms like Coursera.",
    "Sam Altman is the CEO of OpenAI and has played a key role in advancing AI research and development. He is a strong advocate for creating safe and beneficial AI technologies.",
    "Demis Hassabis is the CEO of DeepMind and is celebrated for his innovative approach to artificial intelligence. He gained prominence for developing systems that can master complex games like AlphaGo.",
    "Sundar Pichai is the CEO of Google and Alphabet Inc., and he is praised for leading innovation across Google's vast product ecosystem. His leadership has significantly enhanced user experiences on a global scale.",
    "Arvind Krishna is the CEO of IBM and is recognized for transforming the company towards cloud computing and AI solutions. He focuses on providing cutting-edge technologies to address modern business challenges.",
]

langchain_documents = [Document(page_content=content) for content in content_list]

Setting Up the Vector Store

To enable efficient document retrieval, we'll use OpenAI embeddings and an in-memory vector store:

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = InMemoryVectorStore(embeddings)

_ = vector_store.add_documents(langchain_documents)

Configuring the Retriever

The retriever fetches the most relevant documents based on a query:

retriever = vector_store.as_retriever(search_kwargs={"k": 1})

Implementing the QA Pipeline

Language Model and Prompt Configuration

Leverage OpenAI's GPT models to generate answers:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini")

template = """Answer the question based only on the following context:
{context}

Question: {query}
"""
prompt = ChatPromptTemplate.from_template(template)

qa_chain = prompt | llm | StrOutputParser()

Query Processing

Here's how you can process queries and retrieve answers:

def format_docs(relevant_docs):
    return "\n".join(doc.page_content for doc in relevant_docs)

query = "Who is the CEO of OpenAI?"

relevant_docs = retriever.invoke(query)
qa_chain.invoke({"context": format_docs(relevant_docs), "query": query})

Expected Output:

'The CEO of OpenAI is Sam Altman.'

Evaluating the QA System with Ragas

Creating an Evaluation Dataset

Ragas enables systematic evaluation of RAG systems using metrics like recall and coherence. Here’s how to create an evaluation dataset:

from ragas import EvaluationDataset

dataset = []

sample_queries = [
    "Which CEO is widely recognized for democratizing AI education through platforms like Coursera?",
    "Who is Sam Altman?",
    "Who is Demis Hassabis and how did he gain prominence?",
    "Who is the CEO of Google and Alphabet Inc., praised for leading innovation across Google's product ecosystem?",
    "How did Arvind Krishna transform IBM?",
]

expected_responses = [
    "Andrew Ng is the CEO of Landing AI and is widely recognized for democratizing AI education through platforms like Coursera.",
    ... # Other responses
]

for query, reference in zip(sample_queries, expected_responses):
    relevant_docs = retriever.invoke(query)
    response = qa_chain.invoke({"context": format_docs(relevant_docs), "query": query})
    dataset.append({
        "user_input": query,
        "retrieved_contexts": [rdoc.page_content for rdoc in relevant_docs],
        "response": response,
        "reference": reference,
    })

evaluation_dataset = EvaluationDataset.from_list(dataset)

Applying Ragas Metrics

Evaluate the dataset using key metrics:

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

evaluator_llm = LangchainLLMWrapper(llm)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],
    llm=evaluator_llm,
)

Expected Output:

{'context_recall': 1.0000, 'faithfulness': 0.9500, 'factual_correctness': 0.9140}

Conclusion

Ragas is a powerful tool for evaluating RAG systems, providing insights into retrieval accuracy and generation quality. By following this guide, developers can create robust QA systems and continuously improve their performance. To explore more, visit the Ragas GitHub repository.

Next Steps

Experiment with additional metrics to assess system performance.
Integrate Ragas into larger-scale projects.
Explore advanced RAG workflows with LangChain and other frameworks.

Resources

---------------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

Tomorrow’s leaders are building AI today. Are you one of them?

Introduction

Understanding the setup and installation of Ragas.
Building a simple question-answering (QA) application using LangChain and OpenAI models.
Creating evaluation datasets and utilizing Ragas metrics to analyze performance.
Insights into advanced evaluation techniques and real-world applications.

By the end of this post, you will be equipped to leverage Ragas for enhancing your RAG systems.

Getting Started with Ragas

Setup and Installation

To begin, install Ragas and its dependencies using the following commands:

pip install ragas sacrebleu langchain-openai
pip install git+https://github.com/explodinggradients/ragas.git

Set up your API keys for seamless integration:

from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Key Components

Building a Simple QA Application

Data Preparation

The foundation of any QA system is high-quality data. Here's an example dataset containing brief biographies of prominent AI leaders:

from langchain_core.documents import Document

content_list = [
    "Andrew Ng is the CEO of Landing AI and is known for his pioneering work in deep learning. He is also widely recognized for democratizing AI education through platforms like Coursera.",
    "Sam Altman is the CEO of OpenAI and has played a key role in advancing AI research and development. He is a strong advocate for creating safe and beneficial AI technologies.",
    "Demis Hassabis is the CEO of DeepMind and is celebrated for his innovative approach to artificial intelligence. He gained prominence for developing systems that can master complex games like AlphaGo.",
    "Sundar Pichai is the CEO of Google and Alphabet Inc., and he is praised for leading innovation across Google's vast product ecosystem. His leadership has significantly enhanced user experiences on a global scale.",
    "Arvind Krishna is the CEO of IBM and is recognized for transforming the company towards cloud computing and AI solutions. He focuses on providing cutting-edge technologies to address modern business challenges.",
]

langchain_documents = [Document(page_content=content) for content in content_list]

Setting Up the Vector Store

To enable efficient document retrieval, we'll use OpenAI embeddings and an in-memory vector store:

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = InMemoryVectorStore(embeddings)

_ = vector_store.add_documents(langchain_documents)

Configuring the Retriever

The retriever fetches the most relevant documents based on a query:

retriever = vector_store.as_retriever(search_kwargs={"k": 1})

Implementing the QA Pipeline

Language Model and Prompt Configuration

Leverage OpenAI's GPT models to generate answers:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini")

template = """Answer the question based only on the following context:
{context}

Question: {query}
"""
prompt = ChatPromptTemplate.from_template(template)

qa_chain = prompt | llm | StrOutputParser()

Query Processing

Here's how you can process queries and retrieve answers:

def format_docs(relevant_docs):
    return "\n".join(doc.page_content for doc in relevant_docs)

query = "Who is the CEO of OpenAI?"

relevant_docs = retriever.invoke(query)
qa_chain.invoke({"context": format_docs(relevant_docs), "query": query})

Expected Output:

'The CEO of OpenAI is Sam Altman.'

Evaluating the QA System with Ragas

Creating an Evaluation Dataset

Ragas enables systematic evaluation of RAG systems using metrics like recall and coherence. Here’s how to create an evaluation dataset:

from ragas import EvaluationDataset

dataset = []

sample_queries = [
    "Which CEO is widely recognized for democratizing AI education through platforms like Coursera?",
    "Who is Sam Altman?",
    "Who is Demis Hassabis and how did he gain prominence?",
    "Who is the CEO of Google and Alphabet Inc., praised for leading innovation across Google's product ecosystem?",
    "How did Arvind Krishna transform IBM?",
]

expected_responses = [
    "Andrew Ng is the CEO of Landing AI and is widely recognized for democratizing AI education through platforms like Coursera.",
    ... # Other responses
]

for query, reference in zip(sample_queries, expected_responses):
    relevant_docs = retriever.invoke(query)
    response = qa_chain.invoke({"context": format_docs(relevant_docs), "query": query})
    dataset.append({
        "user_input": query,
        "retrieved_contexts": [rdoc.page_content for rdoc in relevant_docs],
        "response": response,
        "reference": reference,
    })

evaluation_dataset = EvaluationDataset.from_list(dataset)

Applying Ragas Metrics

Evaluate the dataset using key metrics:

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

evaluator_llm = LangchainLLMWrapper(llm)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],
    llm=evaluator_llm,
)

Expected Output:

{'context_recall': 1.0000, 'faithfulness': 0.9500, 'factual_correctness': 0.9140}

Conclusion

Next Steps

Experiment with additional metrics to assess system performance.
Integrate Ragas into larger-scale projects.
Explore advanced RAG workflows with LangChain and other frameworks.

Resources

---------------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

BuildFast Bot

Introduction

Getting Started with Ragas

Setup and Installation

Key Components

Building a Simple QA Application

Data Preparation

Setting Up the Vector Store

Configuring the Retriever

Implementing the QA Pipeline

Language Model and Prompt Configuration

Query Processing

Evaluating the QA System with Ragas

Creating an Evaluation Dataset

Applying Ragas Metrics

Conclusion

Next Steps

Resources

BuildFast Bot

Introduction

Getting Started with Ragas

Setup and Installation

Key Components

Building a Simple QA Application

Data Preparation

Setting Up the Vector Store

Configuring the Retriever

Implementing the QA Pipeline

Language Model and Prompt Configuration

Query Processing

Evaluating the QA System with Ragas

Creating an Evaluation Dataset

Applying Ragas Metrics

Conclusion

Next Steps

Resources