ExtractThinker vs. Manual Work: Why AI Wins Every Time!

Will you let others shape the future for you, or will you lead the way?
Gen AI Launch Pad 2025 is your moment to shine.
Introduction
In the age of AI-driven automation, extracting, processing, and understanding data from diverse document formats is crucial. ExtractThinker is an open-source Document Intelligence framework that integrates seamlessly with Large Language Models (LLMs) to streamline document processing. Whether you need to extract structured information from PDFs, images, or spreadsheets, ExtractThinker offers a powerful ORM-style interface, advanced data extraction capabilities, and flexible classification tools.
In this guide, we will walk you through the installation, setup, and use of ExtractThinker, providing a deep dive into its features and real-world applications. By the end, you'll have a clear understanding of how to automate document intelligence tasks with LLMs.
Key Features
ExtractThinker is designed to handle large-scale document processing efficiently. Some of its standout features include:
- Multi-format Document Support: Extract data from PDFs, images, and spreadsheets.
- Advanced Data Extraction: Define precise extraction contracts using Pydantic models.
- Asynchronous Processing: Process large documents efficiently.
- Flexible Document Loaders: Supports OCR tools like Tesseract, Azure Form Recognizer, and AWS Textract.
- Seamless LLM Integration: Works with OpenAI, Anthropic, Cohere, and more.
- ORM-style Interface: Intuitive, developer-friendly API for document processing.
Now, let’s get started with installing and setting up ExtractThinker.
Installation and Setup
To begin, install ExtractThinker along with necessary dependencies:
!pip install extract-thinker pypdf
To use ExtractThinker with OpenAI’s GPT models, set up your API key in your environment:
import os from google.colab import userdata os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
This step ensures secure authentication while interacting with LLMs for document processing.
Document Loading and Extraction
Step 1: Create a Document Loader
ExtractThinker provides various document loaders. Here, we’ll use DocumentLoaderPyPdf
to handle PDF files.
from extract_thinker import DocumentLoaderPyPdf document_loader = DocumentLoaderPyPdf()
Step 2: Initialize the Extractor
An extractor acts as a bridge between the document loader and the LLM, facilitating data extraction.
from extract_thinker import Extractor extractor = Extractor() extractor.load_document_loader(document_loader) extractor.load_llm("gpt-4o-mini")
Step 3: Define a Data Extraction Contract
ExtractThinker uses Pydantic-based contracts to define the structure of extracted data. Let’s create a contract for processing invoices:
from extract_thinker import Contract class InvoiceContract(Contract): invoice_number: str invoice_date: str
This contract ensures that only the specified fields are extracted from the document.
Step 4: Download and Process a PDF
For demonstration, let’s download a sample invoice PDF and extract its data:
!wget -O invoice.pdf "https://github.com/enoch3712/ExtractThinker/raw/main/examples/invoice.pdf"
Now, extract data based on the contract:
result = extractor.extract("invoice.pdf", InvoiceContract) print(result)
Expected Output
{ "invoice_number": "INV-2024-001", "invoice_date": "2024-01-15" }
The extracted data provides structured information that can be used in downstream applications like financial analysis or record-keeping.
Interactive File Upload and Processing
To enhance user experience, ExtractThinker allows file uploads via a widget interface in Jupyter Notebook or Google Colab.
import ipywidgets as widgets from IPython.display import display import tempfile file_upload = widgets.FileUpload(accept='.pdf', description='Upload PDF') file_path_input = widgets.Text( placeholder="Enter the downloaded PDF file path", description="File Path:", layout=widgets.Layout(width='80%') ) output = widgets.Output() def process_pdf(file_path): with output: print(f"Processing File: {file_path}") result = extractor.extract(file_path, InvoiceContract) print(result) def on_file_uploaded(change): if change['new']: uploaded_file = next(iter(change['new'].values())) with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file: temp_file.write(uploaded_file['content']) temp_file_path = temp_file.name process_pdf(temp_file_path) def on_path_submit(change): if file_path_input.value: process_pdf(file_path_input.value) file_upload.observe(on_file_uploaded, names='value') file_path_input.on_submit(on_path_submit) display(file_upload, file_path_input, output)
This code allows users to either upload a PDF file or provide a file path manually for processing.
Document Classification
ExtractThinker also supports document classification, allowing automated categorization of documents.
Step 1: Define Contracts
from dotenv import load_dotenv from extract_thinker import ( Extractor, Classification, Process, ClassificationStrategy, DocumentLoaderPyPdf, Contract ) load_dotenv() class DriverLicenseContract(Contract): name: str license_number: str
Step 2: Initialize the Extractor
extractor = Extractor() extractor.load_document_loader(DocumentLoaderPyPdf()) extractor.load_llm("gpt-4o-mini")
Step 3: Classify the Document
classifications = [ Classification( name="Invoice", description="An invoice document", contract=InvoiceContract, extractor=extractor, ), Classification( name="Driver License", description="A driver's license document", contract=DriverLicenseContract, extractor=extractor, ), ] result = extractor.classify("invoice.pdf", classifications) print(f"Document classified as: {result.name}") print(f"Confidence level: {result.confidence}")
Expected Output
Document classified as: Invoice Confidence level: 10
Conclusion
ExtractThinker offers a robust and flexible framework for document processing, enabling seamless integration with LLMs for data extraction and classification. By defining structured contracts, utilizing OCR tools, and integrating intelligent classification, businesses can automate document intelligence workflows efficiently.
Resources
- ExtractThinker GitHub Repository
- Pydantic Documentation
- OpenAI API Documentation
- ExtractThinker Build Fast with AI Notebook
---------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.
---------------------------
Resources and Community
Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.
- Website: www.buildfastwithai.com
- LinkedIn: linkedin.com/company/build-fast-with-ai/
- Instagram: instagram.com/buildfastwithai/
- Twitter: x.com/satvikps
- Telegram: t.me/BuildFastWithAI