Giskard Evaluation & Testing Framework for AI Systems

The best time to start with AI was yesterday. The second best time? Right after reading this post. The fastest way? Gen AI Launch Pad’s 6-week transformation.
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.
Introduction
Artificial intelligence (AI) systems are becoming indispensable in industries ranging from healthcare to finance. However, the rapid adoption of AI models raises critical questions about their reliability, fairness, and security. Addressing these concerns is essential to prevent unintended consequences and build trust in AI systems. Giskard, an open-source Python library, provides a comprehensive framework for evaluating and testing AI models. This guide explores Giskard’s powerful capabilities, from setup and integration to model evaluation, showcasing a practical example of building a climate-focused question-answering (QA) system using LangChain and OpenAI models.
By the end of this blog, you will understand:
- How to set up and configure Giskard for your AI workflows.
- The process of building a domain-specific AI model integrated with LangChain.
- Techniques to detect vulnerabilities, including bias and hallucinations, in AI systems.
- The steps to automate testing and ensure compliance and quality.
Setting Up Giskard: The First Steps
Installation
Before we delve into model creation and evaluation, it is crucial to install Giskard and its dependencies. Use the following command to set up your environment:
%pip install "giskard[llm]" langchain langchain-openai langchain-community pypdf faiss-cpu openai tiktoken
This installation includes:
- Giskard: For AI evaluation and testing.
- LangChain: For building composable language model pipelines.
- FAISS: For efficient similarity search and clustering.
- PyPDFLoader: For processing PDF documents.
Why These Libraries?
Giskard acts as the foundation for testing, while LangChain and FAISS simplify the creation of AI models that require advanced document retrieval and processing capabilities. These libraries are especially useful for building QA models based on extensive textual data.
Setting Up the OpenAI API Key
To access OpenAI’s language models, you need to securely set up your API key. The following snippet ensures this is done correctly:
import os from google.colab import userdata os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
This snippet retrieves your OpenAI API key securely from Colab’s user data. Replace this with your preferred method of securely handling API keys if you’re not using Google Colab.
Building an AI Model with LangChain
Now that we’ve set up the environment, let’s build an AI model to answer climate-related questions using data from the IPCC Climate Change Synthesis Report (2023).
Step 1: Preparing the Vector Store
The first step in creating a QA model is to preprocess the document and store it in a retrievable format. We use LangChain’s text processing and FAISS for this purpose:
from langchain import FAISS, PromptTemplate from langchain_openai import OpenAI, OpenAIEmbeddings from langchain.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Prepare vector store (FAISS) with IPPC report text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True) loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf") db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())
Code Breakdown
- PyPDFLoader: Downloads and processes the IPCC PDF report into text format.
- RecursiveCharacterTextSplitter: Splits the document into smaller chunks (1000 characters) with overlaps to preserve context.
- FAISS: Converts these chunks into a searchable vector database using OpenAI embeddings.
Expected Output
This step doesn’t produce a direct output but prepares a vector database (db
) that can be queried for retrieving relevant text chunks.
Real-World Applications
This preprocessing pipeline is essential for tasks like:
- Document search engines.
- Legal document analysis.
- Summarization of extensive reports.
Step 2: Defining the Prompt Template
A well-designed prompt ensures the language model generates accurate and contextually relevant answers. Here is the prompt template:
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard. Your task is to answer common questions on climate change. You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023). Please provide short and clear answers based on the provided context. Be polite and helpful. Context: {context} Question: {question} Your answer: """
Why Use a Prompt Template?
Prompt templates standardize the input to the language model, ensuring consistent responses across different queries. This is particularly important for domain-specific tasks where precision and clarity are paramount.
Step 3: Creating the QA Chain
Combine the vector store and prompt template to build a retrieval-based QA system:
from langchain.chains import RetrievalQA llm = OpenAI(model="gpt-4o", temperature=0) prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"]) climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)
Explanation
- RetrievalQA: Retrieves relevant chunks from the vector database (
db
) and passes them to the language model (llm
) for generating an answer. - OpenAI’s
gpt-4o
: The underlying language model for generating responses.
Example Query
Test the QA chain with a query:
response = climate_qa_chain.invoke({"query": "Is sea level rise avoidable? When will it stop?"}) print(response)
Expected Output:
- "Sea level rise is largely unavoidable due to current warming levels. It will continue for centuries, though mitigation efforts can slow the rate."
Evaluating the Model with Giskard
Wrapping the Model
To enable Giskard’s evaluation and testing functionalities, wrap the QA chain:
!pip install backports.strenum griffe==0.48.0 import giskard import pandas as pd def model_predict(df: pd.DataFrame): """Wraps the LLM call in a simple Python function.""" return [climate_qa_chain.invoke({"query": question}) for question in df["question"]] # Wrap with Giskard giskard_model = giskard.Model( model=model_predict, model_type="text_generation", name="Climate Change Question Answering", description="This model answers any question about climate change based on IPCC reports", feature_names=["question"], )
Key Features
model_predict
: Converts the QA chain into a callable function.- Giskard wrapper: Adds metadata like model type and description for effective testing.
Testing with Example Data
Create a dataset of example queries for testing:
examples = [ "According to the IPCC report, what are key risks in Europe?", "Is sea level rise avoidable? When will it stop?", ] giskard_dataset = giskard.Dataset(pd.DataFrame({"question": examples}), target=None) predictions = giskard_model.predict(giskard_dataset) print(predictions.prediction)
Expected Output
The model generates context-aware answers to each query based on the IPCC report.
Scanning for Vulnerabilities
Run Giskard’s scan to detect issues:
report = giskard.scan(giskard_model, giskard_dataset, only="hallucination") # Full Scan full_report = giskard.scan(giskard_model, giskard_dataset)
Use Case
This scan identifies vulnerabilities such as:
- Hallucinations (unsupported claims).
- Bias in responses.
- Security weaknesses.
Saving the Report
display(full_report) full_report.to_html("scan_report.html")
Save the report for further analysis or sharing.
Generating Automated Test Suites
Giskard can generate test suites from the scan:
test_suite = full_report.generate_test_suite(name="Test suite generated by scan") test_suite.run()
Automated test suites enable continuous validation of model quality during development.
Conclusion
Summary
This guide demonstrated how to:
- Set up Giskard and related libraries for AI evaluation.
- Build a robust QA system using LangChain and OpenAI models.
- Evaluate and enhance the system’s reliability using Giskard’s scanning and testing tools.
Next Steps
- Experiment with different datasets and document types.
- Customize test suites to align with specific domain requirements.
- Explore Giskard’s advanced features for large-scale model validation.
Resources
- Giskard Documentation
- LangChain Documentation
- FAISS Documentation
- OpenAI API Documentation
- Giskard Build Fast With AI NoteBook
---------------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.