buildfastwithaibuildfastwithai
GenAI LaunchpadAI WorkshopsAll blogs
Back to blogs
Tools
Implementation

Extract Structured Data from Unstructured Text Using LangExtract + Gemini

August 6, 2025
5 min read
Extract Structured Data from Unstructured Text Using LangExtract + Gemini

Getting Started with LangExtract: Unlocking Text Analysis with AI

LangExtract is a powerful Python library that simplifies extracting structured data from unstructured text, leveraging advanced AI models like Google’s Gemini and OpenAI’s GPT-4o. Whether you’re analyzing literature, news articles, or medical records, LangExtract enables you to identify specific entities—such as names, products, emotions, or medications—while preserving their context through attributes. Its interactive visualization interface, accessible via Google Colab, makes it an ideal tool for developers, researchers, and enthusiasts looking to streamline text analysis.

This guide, based on the provided LangExtract notebook, walks you through setting up and using the library to extract meaningful insights from text. We’ll cover installation, defining extraction tasks, running extractions, visualizing results, and using both Gemini and OpenAI models, ensuring you can apply LangExtract to diverse use cases.

Why LangExtract?

LangExtract stands out for its ability to transform raw text into structured data with minimal effort. Key features include:

  • Flexible Extraction: Extract entities like names, AI models, products, or medical details with customizable attributes.

  • AI Model Support: Compatible with powerful models like Gemini-2.5-Pro and GPT-4o for accurate results.

  • Interactive Visualization: Displays extracted entities in an interactive HTML interface, highlighting their positions and attributes.

  • Ease of Use: Runs seamlessly in Google Colab, requiring no local setup.

Whether you’re a beginner or an experienced developer, LangExtract’s clear prompts and example-driven approach make it accessible and powerful.

Prerequisites

Before starting, ensure you have:

  • Google Colab Account: For running the notebook in a cloud-based environment.

  • API Keys: Obtain keys for Google’s Gemini (via Google) and/or OpenAI’s GPT-4o (via OpenAI).

  • Basic Python Knowledge: Familiarity with Python and Jupyter notebooks helps, but the process is beginner-friendly.

Step 1: Installation

To use LangExtract, install the library in Google Colab:

!pip -q install langextract

Set up your API keys securely using Colab’s userdata feature:

import os

os.environ["LANGEXTRACT_API_KEY"] = 'GOOGLE_API_KEY'
os.environ["OPENAI_API_KEY"] = 'OPENAI_API_KEY'

Add your API keys in Colab’s secrets panel (🔑 icon) to keep them secure and avoid hardcoding.

Step 2: Defining Extraction Tasks

LangExtract relies on a clear prompt and high-quality examples to guide the AI model.

import textwrap
import langextract as lx

prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")

examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ]
    )
]

Step 3: Running Extractions

input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)

Step 4: Visualizing Results

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html_object = lx.visualize("extraction_results.jsonl")

custom_css = """
<style>
body { background-color: white !important; color: black !important; }
.entity.character { background-color: #DDEEFF !important; color: black !important; }
.entity.emotion { background-color: #C0F5C0 !important; color: black !important; }
.entity.relationship { background-color: #FFEDB3 !important; color: black !important; }
</style>
"""

html = custom_css + html_object.data
with open("visualization.html", "w") as f:
    f.write(html)

from IPython.display import HTML
HTML(html)

Step 5: Using OpenAI Models

input_text = """
The patient was prescribed Lisinopril and Metformin last month.
He takes the Lisinopril 10mg daily for hypertension, but often misses
his Metformin 500mg dose which should be taken twice daily for diabetes.
"""

prompt_description = """
Extract medications with their details, using attributes to group related information:
1. Extract entities in the order they appear in the text
2. Each entity must have a 'medication_group' attribute linking it to its medication
3. All details about a medication should share the same medication_group value
"""

examples = [
    lx.data.ExampleData(
        text="Patient takes Aspirin 100mg daily for heart health and Simvastatin 20mg at bedtime.",
        extractions=[
            lx.data.Extraction(extraction_class="medication", extraction_text="Aspirin", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="dosage", extraction_text="100mg", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="frequency", extraction_text="daily", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="condition", extraction_text="heart health", attributes={"medication_group": "Aspirin"}),
            lx.data.Extraction(extraction_class="medication", extraction_text="Simvastatin", attributes={"medication_group": "Simvastatin"}),
            lx.data.Extraction(extraction_class="dosage", extraction_text="20mg", attributes={"medication_group": "Simvastatin"}),
            lx.data.Extraction(extraction_class="frequency", extraction_text="at bedtime", attributes={"medication_group": "Simvastatin"})
        ]
    )
]

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt_description,
    examples=examples,
    language_model_type=lx.inference.OpenAILanguageModel,
    model_id="gpt-4o",
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)

Step 6: Visualizing Results

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html_object = lx.visualize("extraction_results.jsonl")

custom_css = """
<style>
body { background-color: white !important; color: black !important; }
.entity.character { background-color: #DDEEFF !important; color: black !important; }
.entity.emotion { background-color: #C0F5C0 !important; color: black !important; }
.entity.relationship { background-color: #FFEDB3 !important; color: black !important; }
</style>
"""

html = custom_css + html_object.data
with open("visualization.html", "w") as f:
    f.write(html)

from IPython.display import HTML
HTML(html)

Practical Applications

  • Literature Analysis

  • News Entity Extraction

  • Medical Record Processing

  • Business Intelligence

Troubleshooting

  • API Errors: Check API key setup.

  • Bad Results: Improve prompts or add more examples.

  • Visualization Not Working: Recheck file paths and HTML write logic.

References

  • LangExtract Github : https://github.com/google/langextract

  • LangExtract Build Fast With AI Colab Notebook : https://git.new/LangExtract

Conclusion

LangExtract empowers users to extract structured insights from text with ease, leveraging AI models to handle complex analysis tasks. By defining clear prompts, providing examples, and using interactive visualizations, you can unlock the potential of text data in literature, news, medical records, and more.

Whether you’re a beginner exploring AI or a developer building advanced applications, LangExtract in Google Colab offers a seamless way to get started. Dive in, experiment with your own texts, and discover the power of structured text analysis!


Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

  • Website: www.buildfastwithai.com

  • LinkedIn: linkedin.com/company/build-fast-with-ai

  • Instagram: instagram.com/buildfastwithai

  • Twitter (X): x.com/BuildFastWithAI

  • Telegram: t.me/BuildFastWithAI

Related Articles

How to Run Google's Gemma 3 270M Locally: A Complete Developer's Guide

Aug 18• 3234 views

How to Build the World's Fastest AI Game Generator with Qwen + Cerebras

Aug 8• 654 views

Llama Parse: Transform Unstructured Data with Ease

Jan 8• 1952 views

    You Might Also Like

    7 AI Tools That Changed Development (December 2025 Guide)
    Tools

    7 AI Tools That Changed Development (December 2025 Guide)

    7 AI tools reshaping development: Google Workspace Studio, DeepSeek V3.2, Gemini 3 Deep Think, Kling 2.6, FLUX.2, Mistral 3, and Runway Gen-4.5.

    7 AI Tools That Changed Development (November 2025)
    Tools

    7 AI Tools That Changed Development (November 2025)

    Week 46's top AI releases: GPT-5.1 runs 2-3x faster, Marble creates 3D worlds, Scribe v2 hits 150ms transcription. Discover all 7 breakthrough tools.