Unstructured: The Best Tool for Text Preprocessing

Do you want to be a bystander in the world of tomorrow, or its creator?
Act now—Gen AI Launch Pad 2025 is your gateway to innovation.
Introduction
The rise of Large Language Models (LLMs) has created a need for efficient text preprocessing tools that can handle diverse document formats. Unstructured is an open-source library designed to extract, clean, and structure text from various file types, making it ideal for LLM applications. In this blog, we will explore its capabilities, demonstrate its usage with practical code examples, and show how it integrates with LangChain and ChromaDB for enhanced text processing and vector database ingestion.
Why Use Unstructured?
Key Features:
- Multi-format Support: Works with PDFs, Word documents, HTML, and more. 📄
- Text Extraction: Extracts text while maintaining document structure. 📝
- Data Cleaning: Prepares text for better LLM performance. 🧹
- Element Chunking: Splits text into meaningful segments. 🧩
- Seamless Integration: Works with LangChain and other LLM tools. 🤝
Installation
To begin using Unstructured, install the required dependencies:
pip install unstructured[pdf] langchain_community chromadb tiktoken
This installs Unstructured along with essential libraries for document processing and vector database support.
Setting Up API Keys
If you're using OpenAI models, set up your API key in your environment variables:
import os from google.colab import userdata os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
Extracting Text from a PDF
Extracting text from PDFs is a common need for research papers, reports, and scanned documents. Unstructured makes this process seamless.
Code:
from unstructured.partition.auto import partition import requests pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF URL response = requests.get(pdf_url) with open("example.pdf", "wb") as f: f.write(response.content) # Partition the PDF elements = partition(filename="example.pdf") # Print extracted text for element in elements: print(element.text)
Explanation:
- Downloads a PDF from a URL.
- Uses
partition
to extract text while preserving structure. - Iterates over extracted elements and prints the text.
Expected Output:
Extracted text from the PDF, preserving paragraphs, headers, and formatting.
Real-World Application:
Use this method for processing research papers, business reports, and scanned contracts for LLM-based summarization or analysis.
Extracting Text from a Local .txt
File
For plain text files, Unstructured provides an efficient way to partition and process text.
Code:
from unstructured.partition.text import partition_text # Create a sample text file with open("dummy_text.txt", "w") as f: f.write("This is a sample text file.\n") f.write("It contains multiple lines of text.\n") f.write("Unstructured can process this easily.") # Extract text elements = partition_text(filename="dummy_text.txt") for element in elements: print(element.text)
Expected Output:
This is a sample text file. It contains multiple lines of text. Unstructured can process this easily.
Application:
This method is useful for preprocessing logs, articles, or any text file before feeding it into an LLM.
Extracting Text from a Website
Extracting content from web pages can be crucial for news aggregation, data collection, or competitive analysis.
Code:
from unstructured.partition.html import partition_html import requests url = "https://www.unstructured.io/" response = requests.get(url) html_content = response.text # Partition HTML elements = partition_html(text=html_content) for element in elements: print(element.text)
Expected Output:
Extracted text from the web page, including article content and structured elements.
Use Case:
Use this approach to scrape articles, blog posts, or documentation for LLM-powered summarization or analysis.
Vector Database Ingestion with ChromaDB
Unstructured also helps in creating vector-based document retrieval systems. Here’s how to use it with ChromaDB and LangChain.
Gathering Links from CNN Lite
from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2025"): links.append(f"{cnn_lite_url}{relative_link}")
Ingesting Articles
from langchain.document_loaders import UnstructuredURLLoader loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True) docs = loaders.load()
Storing Documents in ChromaDB
from langchain.vectorstores.chroma import Chroma from langchain.embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(docs, embeddings) query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)
Summarizing Retrieved Documents
from langchain.chat_models import ChatOpenAI from langchain.chains.summarize import load_summarize_chain llm = ChatOpenAI(temperature=0, model_name="gpt-4o") chain = load_summarize_chain(llm, chain_type="stuff") chain.run(query_docs)
Expected Output:
A concise summary of the most relevant article matching the query.
Real-World Application:
- Use case: Automating news summarization.
- Benefit: Reduces manual effort in tracking trending topics.
Conclusion
Unstructured is a powerful tool for preprocessing text from diverse sources, making it an invaluable asset for LLM applications. Whether extracting text from PDFs, processing web content, or integrating with vector databases, Unstructured streamlines workflows for AI-powered applications.
Next Steps
- Try Unstructured with your own dataset.
- Explore LangChain and ChromaDB for more advanced NLP applications.
- Check out Unstructured’s official documentation for further customization.
Resources
---------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.
---------------------------
Resources and Community
Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, our resources will help you understand and implement Generative AI in your projects.
- Website: www.buildfastwithai.com
- LinkedIn: linkedin.com/company/build-fast-with-ai/
- Instagram: instagram.com/buildfastwithai/
- Twitter: x.com/satvikps
- Telegram: t.me/BuildFastWithAI