Giskard Evaluation & Testing Framework for AI Systems

The best time to start with AI was yesterday. The second best time? Right after reading this post. The fastest way? Gen AI Launch Pad’s 6-week transformation.

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

Introduction

Artificial intelligence (AI) systems are becoming indispensable in industries ranging from healthcare to finance. However, the rapid adoption of AI models raises critical questions about their reliability, fairness, and security. Addressing these concerns is essential to prevent unintended consequences and build trust in AI systems. Giskard, an open-source Python library, provides a comprehensive framework for evaluating and testing AI models. This guide explores Giskard’s powerful capabilities, from setup and integration to model evaluation, showcasing a practical example of building a climate-focused question-answering (QA) system using LangChain and OpenAI models.

By the end of this blog, you will understand:

How to set up and configure Giskard for your AI workflows.
The process of building a domain-specific AI model integrated with LangChain.
Techniques to detect vulnerabilities, including bias and hallucinations, in AI systems.
The steps to automate testing and ensure compliance and quality.

Setting Up Giskard: The First Steps

Installation

Before we delve into model creation and evaluation, it is crucial to install Giskard and its dependencies. Use the following command to set up your environment:

%pip install "giskard[llm]" langchain langchain-openai langchain-community pypdf faiss-cpu openai tiktoken

This installation includes:

Giskard: For AI evaluation and testing.
LangChain: For building composable language model pipelines.
FAISS: For efficient similarity search and clustering.
PyPDFLoader: For processing PDF documents.

Why These Libraries?

Giskard acts as the foundation for testing, while LangChain and FAISS simplify the creation of AI models that require advanced document retrieval and processing capabilities. These libraries are especially useful for building QA models based on extensive textual data.

Setting Up the OpenAI API Key

To access OpenAI’s language models, you need to securely set up your API key. The following snippet ensures this is done correctly:

import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

This snippet retrieves your OpenAI API key securely from Colab’s user data. Replace this with your preferred method of securely handling API keys if you’re not using Google Colab.

Building an AI Model with LangChain

Now that we’ve set up the environment, let’s build an AI model to answer climate-related questions using data from the IPCC Climate Change Synthesis Report (2023).

Step 1: Preparing the Vector Store

The first step in creating a QA model is to preprocess the document and store it in a retrievable format. We use LangChain’s text processing and FAISS for this purpose:

from langchain import FAISS, PromptTemplate
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())

Code Breakdown

PyPDFLoader: Downloads and processes the IPCC PDF report into text format.
RecursiveCharacterTextSplitter: Splits the document into smaller chunks (1000 characters) with overlaps to preserve context.
FAISS: Converts these chunks into a searchable vector database using OpenAI embeddings.

Expected Output

This step doesn’t produce a direct output but prepares a vector database (db) that can be queried for retrieving relevant text chunks.

Real-World Applications

This preprocessing pipeline is essential for tasks like:

Document search engines.
Legal document analysis.
Summarization of extensive reports.

Step 2: Defining the Prompt Template

A well-designed prompt ensures the language model generates accurate and contextually relevant answers. Here is the prompt template:

PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

Why Use a Prompt Template?

Prompt templates standardize the input to the language model, ensuring consistent responses across different queries. This is particularly important for domain-specific tasks where precision and clarity are paramount.

Step 3: Creating the QA Chain

Combine the vector store and prompt template to build a retrieval-based QA system:

from langchain.chains import RetrievalQA

llm = OpenAI(model="gpt-4o", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)

Explanation

RetrievalQA: Retrieves relevant chunks from the vector database (db) and passes them to the language model (llm) for generating an answer.
OpenAI’s gpt-4o: The underlying language model for generating responses.

Example Query

Test the QA chain with a query:

response = climate_qa_chain.invoke({"query": "Is sea level rise avoidable? When will it stop?"})
print(response)

Expected Output:

"Sea level rise is largely unavoidable due to current warming levels. It will continue for centuries, though mitigation efforts can slow the rate."

Evaluating the Model with Giskard

Wrapping the Model

To enable Giskard’s evaluation and testing functionalities, wrap the QA chain:

!pip install backports.strenum griffe==0.48.0

import giskard
import pandas as pd

def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function."""
    return [climate_qa_chain.invoke({"query": question}) for question in df["question"]]

# Wrap with Giskard
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["question"],
)

Key Features

model_predict: Converts the QA chain into a callable function.
Giskard wrapper: Adds metadata like model type and description for effective testing.

Testing with Example Data

Create a dataset of example queries for testing:

examples = [
    "According to the IPCC report, what are key risks in Europe?",
    "Is sea level rise avoidable? When will it stop?",
]
giskard_dataset = giskard.Dataset(pd.DataFrame({"question": examples}), target=None)

predictions = giskard_model.predict(giskard_dataset)
print(predictions.prediction)

Expected Output

The model generates context-aware answers to each query based on the IPCC report.

Scanning for Vulnerabilities

Run Giskard’s scan to detect issues:

report = giskard.scan(giskard_model, giskard_dataset, only="hallucination")

# Full Scan
full_report = giskard.scan(giskard_model, giskard_dataset)

Use Case

This scan identifies vulnerabilities such as:

Hallucinations (unsupported claims).
Bias in responses.
Security weaknesses.

Saving the Report

display(full_report)
full_report.to_html("scan_report.html")

Save the report for further analysis or sharing.

Generating Automated Test Suites

Giskard can generate test suites from the scan:

test_suite = full_report.generate_test_suite(name="Test suite generated by scan")
test_suite.run()

Automated test suites enable continuous validation of model quality during development.

Conclusion

Summary

This guide demonstrated how to:

Set up Giskard and related libraries for AI evaluation.
Build a robust QA system using LangChain and OpenAI models.
Evaluate and enhance the system’s reliability using Giskard’s scanning and testing tools.

Next Steps

Experiment with different datasets and document types.
Customize test suites to align with specific domain requirements.
Explore Giskard’s advanced features for large-scale model validation.

Resources

---------------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

BuildFast Bot

Educhain

BuildFast Studio

BuildFast Bot

Educhain

BuildFast Studio

Giskard Evaluation & Testing Framework for AI Systems

Introduction

Setting Up Giskard: The First Steps

Installation

Why These Libraries?

Setting Up the OpenAI API Key

Building an AI Model with LangChain

Step 1: Preparing the Vector Store

Code Breakdown

Expected Output

Real-World Applications

Step 2: Defining the Prompt Template

Why Use a Prompt Template?

Step 3: Creating the QA Chain

Explanation

Example Query

Evaluating the Model with Giskard

Wrapping the Model

Key Features

Testing with Example Data

Expected Output

Scanning for Vulnerabilities

Use Case

Saving the Report

Generating Automated Test Suites

Conclusion

Summary

Next Steps

Resources