Instructor: The Most Popular Library for Simple Structured Outputs

You’re not just reading about AI today — you’re about to build it."

"Don’t just watch the future happen — create it. Join Gen AI Launch Pad 2024 and turn your curiosity into capability before the AI wave leaves you behind. 🚀"

Introduction

As AI models like GPT become more powerful and flexible, developers are often faced with a challenge: how do we get structured outputs from large language models (LLMs)? Enter Instructor, a library designed to simplify structured data extraction from LLMs. In this blog post, we'll explore what makes Instructor so effective, break down the code, and understand how you can integrate it with models like OpenAI's GPT and Cohere's models.

Why Use Instructor?

Instructor makes it easy to prompt LLMs for structured outputs, such as JSON data. Instead of receiving unstructured text, you can request LLMs to provide responses in the format you need. This is especially useful for:

Form Data Extraction: Automating extraction of specific fields from documents.
APIs & Automation: Structuring data for APIs or downstream processing.
Enterprise Use-Cases: Tasks that require predictable and structured results.
Data Pipelines: When you need clean, structured data for analytics or reporting.
Chatbots and Assistants: Ensuring responses from AI assistants follow a predictable format.

Instructor abstracts away complexity, enabling you to build robust applications faster. By specifying a schema for the output, you ensure your AI delivers exactly what you need.

Installation

First, let's install the necessary libraries. The notebook starts with a simple installation step:

!pip install instructor openai==1.57.4 cohere --quiet

Instructor: The main library for structured outputs.
OpenAI: For accessing OpenAI models like GPT-3.5 and GPT-4.
Cohere: An alternative to OpenAI, providing different LLM capabilities.

Output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 249.9/249.9 kB 13.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 77.9 MB/s eta 0:00:00

This installs all necessary packages quietly (without verbose output).

Troubleshooting Installation Issues

Network Issues: If the installation is slow or fails, check your internet connection.
Version Conflicts: If you have older versions of libraries installed, update them using pip install --upgrade.
Environment Issues: Ensure you're working in a clean virtual environment or Colab instance to avoid conflicts.

Setting Up API Keys

Next, you need API keys for OpenAI and Cohere. The code fetches these from Google Colab's userdata storage:

import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ['CO_API_KEY'] = userdata.get('CO_API_KEY')

How to Obtain API Keys

1.OpenAI API Key:

Sign up at OpenAI.
Go to your account settings and generate a new API key.

2.Cohere API Key:

Sign up at Cohere.
Navigate to the API section and generate a new API key.

Security Tips

Never share your API keys publicly or commit them to repositories.
Use environment variables or secure storage options to manage keys.

Importing Libraries

Now, let's import the required libraries:

import instructor
from openai import OpenAI
from pydantic import BaseModel

Instructor: The core library for handling structured outputs.
OpenAI: For interfacing with OpenAI's models.
Pydantic: For defining structured data models.

What is Pydantic?

Pydantic is a powerful data validation and parsing library in Python. It allows you to define schemas (structured models) for your data using Python classes. These schemas ensure that data conforms to the expected format and type, providing a reliable way to validate incoming data and prevent errors. Pydantic is particularly useful when you need to ensure consistency and correctness of data in applications.

Key Features of Pydantic

Type Enforcement: Ensures that data matches specified types, such as str, float, int, or custom types.
Validation: Automatically validates data against the defined schema and raises clear error messages if the data is incorrect.
Serialization/Deserialization: Converts data between different formats (e.g., JSON to Python objects and vice versa).
Nested Models: Supports defining complex schemas with nested data structures.
Error Handling: Provides detailed error messages when validation fails, making debugging easier.
Automatic Data Parsing: Automatically parses input data, transforming it to the correct types.

Example of Pydantic Model

Here's an example of a simple pydantic model:

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str

# Creating a User instance
user = User(name="Alice", age=30, email="alice@example.com")
print(user)

Output:

name='Alice' age=30 email='alice@example.com'

If you provide incorrect data types, Pydantic will raise a validation error:

try:
    user = User(name="Alice", age="thirty", email="alice@example.com")
except Exception as e:
    print(e)

Output:

age
  value is not a valid integer (type=type_error.integer)

Why Use Pydantic with Instructor?

When combined with Instructor, Pydantic helps define the structure of the data you expect from an LLM. This means you can:

Enforce Data Integrity: Ensure the LLM’s response conforms to your schema.
Reduce Errors: Identify and handle invalid outputs gracefully.
Streamline Processing: Easily integrate structured outputs into your workflows, APIs, and data pipelines.

Instructor uses Pydantic models to guide the LLM in generating consistent, structured outputs, making your applications more reliable and easier to maintain.

Creating a Structured Data Model

Here's an example of how to define a structured output using pydantic and Instructor:

class WeatherResponse(BaseModel):
    location: str
    temperature: float
    condition: str

In this example:

WeatherResponse: A pydantic model specifying the desired fields:
location: Name of the location (string).
temperature: The temperature in degrees (float).
condition: The weather condition (string).

This model tells the LLM to output responses matching this structure.

Why Use Structured Models?

Consistency: Ensures the LLM output follows a predictable structure.
Error Reduction: Reduces the chances of unexpected or unusable data.
Easier Parsing: Simplifies downstream processing and integration with APIs or databases.

Error Handling

Instructor can gracefully handle errors when the model output doesn't match the expected structure. If the LLM returns an output that doesn't align with the defined pydantic model, Instructor raises a validation error.

Example of Error Handling

try:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Give me a list of temperatures."}],
        response_model=WeatherResponse
    )
    print(response)
except Exception as e:
    print("Error:", e)

Output:

Error: 1 validation error for WeatherResponse
response -> location
  field required (type=value_error.missing)

This helps ensure your application can handle unexpected outputs gracefully.

Conclusion

The Instructor library is a powerful tool for extracting structured data from large language models like OpenAI's GPT and Cohere's models. By combining the flexibility of LLMs with the precision of pydantic schemas, Instructor allows you to build applications that require consistent, structured outputs with ease.

Key Takeaways:

Ease of Use: Instructor simplifies prompting for structured outputs.
Consistency: Ensure predictable results by defining pydantic schemas.
Flexibility: Works with both OpenAI and Cohere models.
Robustness: Built-in error handling for invalid outputs.

Whether you're building chatbots, automating data pipelines, or working on enterprise AI solutions, Instructor can help streamline your development process.

Resources

Instructor GitHub Repository: Instructor on GitHub
OpenAI API Documentation: OpenAI Docs
Cohere API Documentation: Cohere Docs
Pydantic Documentation: Pydantic Docs
Instructor Build Fast with AI: NoteBook

---------------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

BuildFast Bot

Educhain

BuildFast Studio

BuildFast Bot

Educhain

BuildFast Studio

Instructor: The Most Popular Library for Simple Structured Outputs

Introduction

Why Use Instructor?

Installation

Troubleshooting Installation Issues

Setting Up API Keys

How to Obtain API Keys

Security Tips

Importing Libraries

What is Pydantic?

Key Features of Pydantic

Example of Pydantic Model

Output:

Why Use Pydantic with Instructor?

Creating a Structured Data Model

Why Use Structured Models?

Error Handling

Example of Error Handling

Output:

Conclusion

Key Takeaways:

Resources