buildfastwithaibuildfastwithai
GenAI LaunchpadAI WorkshopsAll blogs
Back to blogs

ExtractThinker vs. Manual Work: Why AI Wins Every Time!

January 30, 2025
5 min read
ExtractThinker vs. Manual Work: Why AI Wins Every Time!

Will you let others shape the future for you, or will you lead the way?

Gen AI Launch Pad 2025 is your moment to shine.

Introduction

In the age of AI-driven automation, extracting, processing, and understanding data from diverse document formats is crucial. ExtractThinker is an open-source Document Intelligence framework that integrates seamlessly with Large Language Models (LLMs) to streamline document processing. Whether you need to extract structured information from PDFs, images, or spreadsheets, ExtractThinker offers a powerful ORM-style interface, advanced data extraction capabilities, and flexible classification tools.

In this guide, we will walk you through the installation, setup, and use of ExtractThinker, providing a deep dive into its features and real-world applications. By the end, you'll have a clear understanding of how to automate document intelligence tasks with LLMs.

Key Features

ExtractThinker is designed to handle large-scale document processing efficiently. Some of its standout features include:

  • Multi-format Document Support: Extract data from PDFs, images, and spreadsheets.
  • Advanced Data Extraction: Define precise extraction contracts using Pydantic models.
  • Asynchronous Processing: Process large documents efficiently.
  • Flexible Document Loaders: Supports OCR tools like Tesseract, Azure Form Recognizer, and AWS Textract.
  • Seamless LLM Integration: Works with OpenAI, Anthropic, Cohere, and more.
  • ORM-style Interface: Intuitive, developer-friendly API for document processing.

Now, let’s get started with installing and setting up ExtractThinker.

Installation and Setup

To begin, install ExtractThinker along with necessary dependencies:

!pip install extract-thinker pypdf

To use ExtractThinker with OpenAI’s GPT models, set up your API key in your environment:

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

This step ensures secure authentication while interacting with LLMs for document processing.

Document Loading and Extraction

Step 1: Create a Document Loader

ExtractThinker provides various document loaders. Here, we’ll use DocumentLoaderPyPdf to handle PDF files.

from extract_thinker import DocumentLoaderPyPdf

document_loader = DocumentLoaderPyPdf()

Step 2: Initialize the Extractor

An extractor acts as a bridge between the document loader and the LLM, facilitating data extraction.

from extract_thinker import Extractor

extractor = Extractor()
extractor.load_document_loader(document_loader)
extractor.load_llm("gpt-4o-mini")

Step 3: Define a Data Extraction Contract

ExtractThinker uses Pydantic-based contracts to define the structure of extracted data. Let’s create a contract for processing invoices:

from extract_thinker import Contract

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

This contract ensures that only the specified fields are extracted from the document.

Step 4: Download and Process a PDF

For demonstration, let’s download a sample invoice PDF and extract its data:

!wget -O invoice.pdf "https://github.com/enoch3712/ExtractThinker/raw/main/examples/invoice.pdf"

Now, extract data based on the contract:

result = extractor.extract("invoice.pdf", InvoiceContract)
print(result)

Expected Output

{
    "invoice_number": "INV-2024-001",
    "invoice_date": "2024-01-15"
}

The extracted data provides structured information that can be used in downstream applications like financial analysis or record-keeping.

Interactive File Upload and Processing

To enhance user experience, ExtractThinker allows file uploads via a widget interface in Jupyter Notebook or Google Colab.

import ipywidgets as widgets
from IPython.display import display
import tempfile

file_upload = widgets.FileUpload(accept='.pdf', description='Upload PDF')

file_path_input = widgets.Text(
    placeholder="Enter the downloaded PDF file path",
    description="File Path:",
    layout=widgets.Layout(width='80%')
)

output = widgets.Output()

def process_pdf(file_path):
    with output:
        print(f"Processing File: {file_path}")
        result = extractor.extract(file_path, InvoiceContract)
        print(result)

def on_file_uploaded(change):
    if change['new']:
        uploaded_file = next(iter(change['new'].values()))
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
            temp_file.write(uploaded_file['content'])
            temp_file_path = temp_file.name
            process_pdf(temp_file_path)

def on_path_submit(change):
    if file_path_input.value:
        process_pdf(file_path_input.value)

file_upload.observe(on_file_uploaded, names='value')
file_path_input.on_submit(on_path_submit)

display(file_upload, file_path_input, output)

This code allows users to either upload a PDF file or provide a file path manually for processing.

Document Classification

ExtractThinker also supports document classification, allowing automated categorization of documents.

Step 1: Define Contracts

from dotenv import load_dotenv
from extract_thinker import (
    Extractor, Classification, Process, ClassificationStrategy,
    DocumentLoaderPyPdf, Contract
)

load_dotenv()

class DriverLicenseContract(Contract):
    name: str
    license_number: str

Step 2: Initialize the Extractor

extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

Step 3: Classify the Document

classifications = [
    Classification(
        name="Invoice",
        description="An invoice document",
        contract=InvoiceContract,
        extractor=extractor,
    ),
    Classification(
        name="Driver License",
        description="A driver's license document",
        contract=DriverLicenseContract,
        extractor=extractor,
    ),
]

result = extractor.classify("invoice.pdf", classifications)

print(f"Document classified as: {result.name}")
print(f"Confidence level: {result.confidence}")

Expected Output

Document classified as: Invoice
Confidence level: 10

Conclusion

ExtractThinker offers a robust and flexible framework for document processing, enabling seamless integration with LLMs for data extraction and classification. By defining structured contracts, utilizing OCR tools, and integrating intelligent classification, businesses can automate document intelligence workflows efficiently.

Resources

  • ExtractThinker GitHub Repository
  • Pydantic Documentation
  • OpenAI API Documentation
  • ExtractThinker Build Fast with AI Notebook

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

  • Website: www.buildfastwithai.com
  • LinkedIn: linkedin.com/company/build-fast-with-ai/
  • Instagram: instagram.com/buildfastwithai/
  • Twitter: x.com/satvikps
  • Telegram: t.me/BuildFastWithAI