Unstructured: The Best Tool for Text Preprocessing

Do you want to be a bystander in the world of tomorrow, or its creator?

Act now—Gen AI Launch Pad 2025 is your gateway to innovation.

Introduction

The rise of Large Language Models (LLMs) has created a need for efficient text preprocessing tools that can handle diverse document formats. Unstructured is an open-source library designed to extract, clean, and structure text from various file types, making it ideal for LLM applications. In this blog, we will explore its capabilities, demonstrate its usage with practical code examples, and show how it integrates with LangChain and ChromaDB for enhanced text processing and vector database ingestion.

Why Use Unstructured?

Key Features:

Multi-format Support: Works with PDFs, Word documents, HTML, and more. 📄
Text Extraction: Extracts text while maintaining document structure. 📝
Data Cleaning: Prepares text for better LLM performance. 🧹
Element Chunking: Splits text into meaningful segments. 🧩
Seamless Integration: Works with LangChain and other LLM tools. 🤝

Installation

To begin using Unstructured, install the required dependencies:

pip install unstructured[pdf] langchain_community chromadb tiktoken

This installs Unstructured along with essential libraries for document processing and vector database support.

Setting Up API Keys

If you're using OpenAI models, set up your API key in your environment variables:

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

Extracting Text from a PDF

Extracting text from PDFs is a common need for research papers, reports, and scanned documents. Unstructured makes this process seamless.

Code:

from unstructured.partition.auto import partition
import requests

pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"  # Example PDF URL
response = requests.get(pdf_url)

with open("example.pdf", "wb") as f:
    f.write(response.content)

# Partition the PDF
elements = partition(filename="example.pdf")

# Print extracted text
for element in elements:
    print(element.text)

Explanation:

Downloads a PDF from a URL.
Uses partition to extract text while preserving structure.
Iterates over extracted elements and prints the text.

Expected Output:

Extracted text from the PDF, preserving paragraphs, headers, and formatting.

Real-World Application:

Use this method for processing research papers, business reports, and scanned contracts for LLM-based summarization or analysis.

Extracting Text from a Local `.txt` File

For plain text files, Unstructured provides an efficient way to partition and process text.

Code:

from unstructured.partition.text import partition_text

# Create a sample text file
with open("dummy_text.txt", "w") as f:
    f.write("This is a sample text file.\n")
    f.write("It contains multiple lines of text.\n")
    f.write("Unstructured can process this easily.")

# Extract text
elements = partition_text(filename="dummy_text.txt")

for element in elements:
    print(element.text)

Expected Output:

This is a sample text file.
It contains multiple lines of text.
Unstructured can process this easily.

Application:

This method is useful for preprocessing logs, articles, or any text file before feeding it into an LLM.

Extracting Text from a Website

Extracting content from web pages can be crucial for news aggregation, data collection, or competitive analysis.

Code:

from unstructured.partition.html import partition_html
import requests

url = "https://www.unstructured.io/"
response = requests.get(url)
html_content = response.text

# Partition HTML
elements = partition_html(text=html_content)

for element in elements:
    print(element.text)

Expected Output:

Extracted text from the web page, including article content and structured elements.

Use Case:

Use this approach to scrape articles, blog posts, or documentation for LLM-powered summarization or analysis.

Vector Database Ingestion with ChromaDB

Unstructured also helps in creating vector-based document retrieval systems. Here’s how to use it with ChromaDB and LangChain.

Gathering Links from CNN Lite

from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2025"):
            links.append(f"{cnn_lite_url}{relative_link}")

Ingesting Articles

from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)
docs = loaders.load()

Storing Documents in ChromaDB

from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)
query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)

Summarizing Retrieved Documents

from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(query_docs)

Expected Output:

A concise summary of the most relevant article matching the query.

Real-World Application:

Use case: Automating news summarization.
Benefit: Reduces manual effort in tracking trending topics.

Conclusion

Unstructured is a powerful tool for preprocessing text from diverse sources, making it an invaluable asset for LLM applications. Whether extracting text from PDFs, processing web content, or integrating with vector databases, Unstructured streamlines workflows for AI-powered applications.

Next Steps

Try Unstructured with your own dataset.
Explore LangChain and ChromaDB for more advanced NLP applications.
Check out Unstructured’s official documentation for further customization.

Resources

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, our resources will help you understand and implement Generative AI in your projects.

Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai/
Instagram: instagram.com/buildfastwithai/
Twitter: x.com/satvikps
Telegram: t.me/BuildFastWithAI

BuildFast Bot

Educhain

BuildFast Studio

BuildFast Bot

Educhain

BuildFast Studio

Unstructured: The Best Tool for Text Preprocessing

Introduction

Why Use Unstructured?

Key Features:

Installation

Setting Up API Keys

Extracting Text from a PDF

Code:

Explanation:

Expected Output:

Real-World Application:

Extracting Text from a Local `.txt` File

Code:

Expected Output:

Application:

Extracting Text from a Website

Code:

Expected Output:

Use Case:

Vector Database Ingestion with ChromaDB

Gathering Links from CNN Lite

Ingesting Articles

Storing Documents in ChromaDB

Summarizing Retrieved Documents

Expected Output:

Real-World Application:

Conclusion

Next Steps

Resources

Resources and Community

BuildFast Bot

BuildFast Bot

Introduction

Why Use Unstructured?

Key Features:

Installation

Setting Up API Keys

Extracting Text from a PDF

Code:

Explanation:

Expected Output:

Real-World Application:

Extracting Text from a Local .txt File

Code:

Expected Output:

Application:

Extracting Text from a Website

Code:

Expected Output:

Use Case:

Vector Database Ingestion with ChromaDB

Gathering Links from CNN Lite

Ingesting Articles

Storing Documents in ChromaDB

Summarizing Retrieved Documents

Expected Output:

Real-World Application:

Conclusion

Next Steps

Resources

Resources and Community

Extracting Text from a Local `.txt` File