Extract Structured Data from Unstructured Text Using LangExtract + Gemini

Getting Started with LangExtract: Unlocking Text Analysis with AI

LangExtract is a powerful Python library that simplifies extracting structured data from unstructured text, leveraging advanced AI models like Google’s Gemini and OpenAI’s GPT-4o. Whether you’re analyzing literature, news articles, or medical records, LangExtract enables you to identify specific entities—such as names, products, emotions, or medications—while preserving their context through attributes. Its interactive visualization interface, accessible via Google Colab, makes it an ideal tool for developers, researchers, and enthusiasts looking to streamline text analysis.

This guide, based on the provided LangExtract notebook, walks you through setting up and using the library to extract meaningful insights from text. We’ll cover installation, defining extraction tasks, running extractions, visualizing results, and using both Gemini and OpenAI models, ensuring you can apply LangExtract to diverse use cases.

Why LangExtract?

LangExtract stands out for its ability to transform raw text into structured data with minimal effort. Key features include:

Flexible Extraction: Extract entities like names, AI models, products, or medical details with customizable attributes.
AI Model Support: Compatible with powerful models like Gemini-2.5-Pro and GPT-4o for accurate results.
Interactive Visualization: Displays extracted entities in an interactive HTML interface, highlighting their positions and attributes.
Ease of Use: Runs seamlessly in Google Colab, requiring no local setup.

Whether you’re a beginner or an experienced developer, LangExtract’s clear prompts and example-driven approach make it accessible and powerful.

Prerequisites

Before starting, ensure you have:

Google Colab Account: For running the notebook in a cloud-based environment.
API Keys: Obtain keys for Google’s Gemini (via Google) and/or OpenAI’s GPT-4o (via OpenAI).
Basic Python Knowledge: Familiarity with Python and Jupyter notebooks helps, but the process is beginner-friendly.

Step 1: Installation

To use LangExtract, install the library in Google Colab:

!pip -q install langextract

Set up your API keys securely using Colab’s userdata feature:

import os

os.environ["LANGEXTRACT_API_KEY"] = 'GOOGLE_API_KEY'
os.environ["OPENAI_API_KEY"] = 'OPENAI_API_KEY'

Add your API keys in Colab’s secrets panel (🔑 icon) to keep them secure and avoid hardcoding.

Step 2: Defining Extraction Tasks

LangExtract relies on a clear prompt and high-quality examples to guide the AI model.

import textwrap
import langextract as lx

prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")

examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ]
    )
]

Step 3: Running Extractions

input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)

Step 4: Visualizing Results

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html_object = lx.visualize("extraction_results.jsonl")

custom_css = """
<style>
body { background-color: white !important; color: black !important; }
.entity.character { background-color: #DDEEFF !important; color: black !important; }
.entity.emotion { background-color: #C0F5C0 !important; color: black !important; }
.entity.relationship { background-color: #FFEDB3 !important; color: black !important; }
</style>
"""

html = custom_css + html_object.data
with open("visualization.html", "w") as f:
    f.write(html)

from IPython.display import HTML
HTML(html)

Step 5: Using OpenAI Models

input_text = """
The patient was prescribed Lisinopril and Metformin last month.
He takes the Lisinopril 10mg daily for hypertension, but often misses
his Metformin 500mg dose which should be taken twice daily for diabetes.
"""

prompt_description = """
Extract medications with their details, using attributes to group related information:
1. Extract entities in the order they appear in the text
2. Each entity must have a 'medication_group' attribute linking it to its medication
3. All details about a medication should share the same medication_group value
"""

examples = [
    lx.data.ExampleData(
        text="Patient takes Aspirin 100mg daily for heart health and Simvastatin 20mg at bedtime.",
        extractions=[
            lx.data.Extraction(extraction_class="medication", extraction_text="Aspirin", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="dosage", extraction_text="100mg", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="frequency", extraction_text="daily", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="condition", extraction_text="heart health", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="medication", extraction_text="Simvastatin", attributes={"medication_group": "Simvastatin"}),
            lx.data.Extraction(extraction_class="dosage", extraction_text="20mg", attributes={"medication_group": "Simvastatin"}),
            lx.data.Extraction(extraction_class="frequency", extraction_text="at bedtime", attributes={"medication_group": "Simvastatin"})
        ]
    )
]

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt_description,
    examples=examples,
    language_model_type=lx.inference.OpenAILanguageModel,
    model_id="gpt-4o",
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)

Step 6: Visualizing Results

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html_object = lx.visualize("extraction_results.jsonl")

custom_css = """
<style>
body { background-color: white !important; color: black !important; }
.entity.character { background-color: #DDEEFF !important; color: black !important; }
.entity.emotion { background-color: #C0F5C0 !important; color: black !important; }
.entity.relationship { background-color: #FFEDB3 !important; color: black !important; }
</style>
"""

html = custom_css + html_object.data
with open("visualization.html", "w") as f:
    f.write(html)

from IPython.display import HTML
HTML(html)

Practical Applications

Literature Analysis
News Entity Extraction
Medical Record Processing
Business Intelligence

Troubleshooting

API Errors: Check API key setup.
Bad Results: Improve prompts or add more examples.
Visualization Not Working: Recheck file paths and HTML write logic.

References

LangExtract Github : https://github.com/google/langextract
LangExtract Build Fast With AI Colab Notebook : https://git.new/LangExtract

Conclusion

LangExtract empowers users to extract structured insights from text with ease, leveraging AI models to handle complex analysis tasks. By defining clear prompts, providing examples, and using interactive visualizations, you can unlock the potential of text data in literature, news, medical records, and more.

Whether you’re a beginner exploring AI or a developer building advanced applications, LangExtract in Google Colab offers a seamless way to get started. Dive in, experiment with your own texts, and discover the power of structured text analysis!

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai
Instagram: instagram.com/buildfastwithai
Twitter (X): x.com/BuildFastWithAI
Telegram: t.me/BuildFastWithAI

You Might Also Like

LLMs

Building with LLMs: A Practical Guide to API Integration

This blog explores the most popular large language models and their integration capabilities for building chatbots, natural language search, and other LLM-based products. We’ll also explain how to choose the right LLM for your business goals and examine real-world use cases.

15 min read8 months ago

Tools

Open Interpreter: Local Code Execution with LLMs

Discover how to harness the power of Large Language Models (LLMs) for local code execution! Learn to generate, execute, and debug Python code effortlessly, streamline workflows, and enhance productivity. Dive into practical examples, real-world applications, and expert tips in this guide!

5 min read8 months ago