Instructor: The Most Popular Library for Simple Structured Outputs

You’re not just reading about AI today — you’re about to build it."
"Don’t just watch the future happen — create it. Join Gen AI Launch Pad 2024 and turn your curiosity into capability before the AI wave leaves you behind. 🚀"
Introduction
As AI models like GPT become more powerful and flexible, developers are often faced with a challenge: how do we get structured outputs from large language models (LLMs)? Enter Instructor, a library designed to simplify structured data extraction from LLMs. In this blog post, we'll explore what makes Instructor so effective, break down the code, and understand how you can integrate it with models like OpenAI's GPT and Cohere's models.
Why Use Instructor?
Instructor makes it easy to prompt LLMs for structured outputs, such as JSON data. Instead of receiving unstructured text, you can request LLMs to provide responses in the format you need. This is especially useful for:
- Form Data Extraction: Automating extraction of specific fields from documents.
- APIs & Automation: Structuring data for APIs or downstream processing.
- Enterprise Use-Cases: Tasks that require predictable and structured results.
- Data Pipelines: When you need clean, structured data for analytics or reporting.
- Chatbots and Assistants: Ensuring responses from AI assistants follow a predictable format.
Instructor abstracts away complexity, enabling you to build robust applications faster. By specifying a schema for the output, you ensure your AI delivers exactly what you need.
Installation
First, let's install the necessary libraries. The notebook starts with a simple installation step:
!pip install instructor openai==1.57.4 cohere --quiet
- Instructor: The main library for structured outputs.
- OpenAI: For accessing OpenAI models like GPT-3.5 and GPT-4.
- Cohere: An alternative to OpenAI, providing different LLM capabilities.
Output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 249.9/249.9 kB 13.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 77.9 MB/s eta 0:00:00
This installs all necessary packages quietly (without verbose output).
Troubleshooting Installation Issues
- Network Issues: If the installation is slow or fails, check your internet connection.
- Version Conflicts: If you have older versions of libraries installed, update them using
pip install --upgrade
. - Environment Issues: Ensure you're working in a clean virtual environment or Colab instance to avoid conflicts.
Setting Up API Keys
Next, you need API keys for OpenAI and Cohere. The code fetches these from Google Colab's userdata
storage:
import os from google.colab import userdata os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY') os.environ['CO_API_KEY'] = userdata.get('CO_API_KEY')
How to Obtain API Keys
1.OpenAI API Key:
- Sign up at OpenAI.
- Go to your account settings and generate a new API key.
2.Cohere API Key:
- Sign up at Cohere.
- Navigate to the API section and generate a new API key.
Security Tips
- Never share your API keys publicly or commit them to repositories.
- Use environment variables or secure storage options to manage keys.
Importing Libraries
Now, let's import the required libraries:
import instructor from openai import OpenAI from pydantic import BaseModel
- Instructor: The core library for handling structured outputs.
- OpenAI: For interfacing with OpenAI's models.
- Pydantic: For defining structured data models.
What is Pydantic?
Pydantic is a powerful data validation and parsing library in Python. It allows you to define schemas (structured models) for your data using Python classes. These schemas ensure that data conforms to the expected format and type, providing a reliable way to validate incoming data and prevent errors. Pydantic is particularly useful when you need to ensure consistency and correctness of data in applications.
Key Features of Pydantic
- Type Enforcement: Ensures that data matches specified types, such as
str
,float
,int
, or custom types. - Validation: Automatically validates data against the defined schema and raises clear error messages if the data is incorrect.
- Serialization/Deserialization: Converts data between different formats (e.g., JSON to Python objects and vice versa).
- Nested Models: Supports defining complex schemas with nested data structures.
- Error Handling: Provides detailed error messages when validation fails, making debugging easier.
- Automatic Data Parsing: Automatically parses input data, transforming it to the correct types.
Example of Pydantic Model
Here's an example of a simple pydantic
model:
from pydantic import BaseModel class User(BaseModel): name: str age: int email: str # Creating a User instance user = User(name="Alice", age=30, email="alice@example.com") print(user)
Output:
name='Alice' age=30 email='alice@example.com'
If you provide incorrect data types, Pydantic will raise a validation error:
try: user = User(name="Alice", age="thirty", email="alice@example.com") except Exception as e: print(e)
Output:
age value is not a valid integer (type=type_error.integer)
Why Use Pydantic with Instructor?
When combined with Instructor, Pydantic helps define the structure of the data you expect from an LLM. This means you can:
- Enforce Data Integrity: Ensure the LLM’s response conforms to your schema.
- Reduce Errors: Identify and handle invalid outputs gracefully.
- Streamline Processing: Easily integrate structured outputs into your workflows, APIs, and data pipelines.
Instructor uses Pydantic models to guide the LLM in generating consistent, structured outputs, making your applications more reliable and easier to maintain.
Creating a Structured Data Model
Here's an example of how to define a structured output using pydantic
and Instructor:
class WeatherResponse(BaseModel): location: str temperature: float condition: str
In this example:
WeatherResponse
: Apydantic
model specifying the desired fields:location
: Name of the location (string).temperature
: The temperature in degrees (float).condition
: The weather condition (string).
This model tells the LLM to output responses matching this structure.
Why Use Structured Models?
- Consistency: Ensures the LLM output follows a predictable structure.
- Error Reduction: Reduces the chances of unexpected or unusable data.
- Easier Parsing: Simplifies downstream processing and integration with APIs or databases.
Error Handling
Instructor can gracefully handle errors when the model output doesn't match the expected structure. If the LLM returns an output that doesn't align with the defined pydantic
model, Instructor raises a validation error.
Example of Error Handling
try: response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Give me a list of temperatures."}], response_model=WeatherResponse ) print(response) except Exception as e: print("Error:", e)
Output:
Error: 1 validation error for WeatherResponse response -> location field required (type=value_error.missing)
This helps ensure your application can handle unexpected outputs gracefully.
Conclusion
The Instructor library is a powerful tool for extracting structured data from large language models like OpenAI's GPT and Cohere's models. By combining the flexibility of LLMs with the precision of pydantic
schemas, Instructor allows you to build applications that require consistent, structured outputs with ease.
Key Takeaways:
- Ease of Use: Instructor simplifies prompting for structured outputs.
- Consistency: Ensure predictable results by defining
pydantic
schemas. - Flexibility: Works with both OpenAI and Cohere models.
- Robustness: Built-in error handling for invalid outputs.
Whether you're building chatbots, automating data pipelines, or working on enterprise AI solutions, Instructor can help streamline your development process.
Resources
- Instructor GitHub Repository: Instructor on GitHub
- OpenAI API Documentation: OpenAI Docs
- Cohere API Documentation: Cohere Docs
- Pydantic Documentation: Pydantic Docs
- Instructor Build Fast with AI: NoteBook
---------------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.