Extract Structured Data from Unstructured Text Using LangExtract + Gemini
Learn how to extract structured data like names, dates, or medications from messy unstructured text using LangExtract and Gemini. Perfect for devs & AI workflows.

Getting Started with LangExtract: Unlocking Text Analysis with AI
LangExtract is a powerful Python library that simplifies extracting structured data from unstructured text, leveraging advanced AI models like Google’s Gemini and OpenAI’s GPT-4o. Whether you’re analyzing literature, news articles, or medical records, LangExtract enables you to identify specific entities—such as names, products, emotions, or medications—while preserving their context through attributes. Its interactive visualization interface, accessible via Google Colab, makes it an ideal tool for developers, researchers, and enthusiasts looking to streamline text analysis.
This guide, based on the provided LangExtract notebook, walks you through setting up and using the library to extract meaningful insights from text. We’ll cover installation, defining extraction tasks, running extractions, visualizing results, and using both Gemini and OpenAI models, ensuring you can apply LangExtract to diverse use cases.
Why LangExtract?
LangExtract stands out for its ability to transform raw text into structured data with minimal effort. Key features include:
Flexible Extraction: Extract entities like names, AI models, products, or medical details with customizable attributes.
AI Model Support: Compatible with powerful models like Gemini-2.5-Pro and GPT-4o for accurate results.
Interactive Visualization: Displays extracted entities in an interactive HTML interface, highlighting their positions and attributes.
Ease of Use: Runs seamlessly in Google Colab, requiring no local setup.
Whether you’re a beginner or an experienced developer, LangExtract’s clear prompts and example-driven approach make it accessible and powerful.
Prerequisites
Before starting, ensure you have:
Google Colab Account: For running the notebook in a cloud-based environment.
API Keys: Obtain keys for Google’s Gemini (via Google) and/or OpenAI’s GPT-4o (via OpenAI).
Basic Python Knowledge: Familiarity with Python and Jupyter notebooks helps, but the process is beginner-friendly.
Step 1: Installation
To use LangExtract, install the library in Google Colab:
!pip -q install langextract
Set up your API keys securely using Colab’s userdata feature:
import os
os.environ["LANGEXTRACT_API_KEY"] = 'GOOGLE_API_KEY'
os.environ["OPENAI_API_KEY"] = 'OPENAI_API_KEY'
Add your API keys in Colab’s secrets panel (🔑 icon) to keep them secure and avoid hardcoding.
Step 2: Defining Extraction Tasks
LangExtract relies on a clear prompt and high-quality examples to guide the AI model.
import textwrap
import langextract as lx
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
extractions=[
lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
]
)
]
Step 3: Running Extractions
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-pro"
)
Step 4: Visualizing Results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html_object = lx.visualize("extraction_results.jsonl")
custom_css = """
<style>
body { background-color: white !important; color: black !important; }
.entity.character { background-color: #DDEEFF !important; color: black !important; }
.entity.emotion { background-color: #C0F5C0 !important; color: black !important; }
.entity.relationship { background-color: #FFEDB3 !important; color: black !important; }
</style>
"""
html = custom_css + html_object.data
with open("visualization.html", "w") as f:
f.write(html)
from IPython.display import HTML
HTML(html)
Step 5: Using OpenAI Models
input_text = """
The patient was prescribed Lisinopril and Metformin last month.
He takes the Lisinopril 10mg daily for hypertension, but often misses
his Metformin 500mg dose which should be taken twice daily for diabetes.
"""
prompt_description = """
Extract medications with their details, using attributes to group related information:
1. Extract entities in the order they appear in the text
2. Each entity must have a 'medication_group' attribute linking it to its medication
3. All details about a medication should share the same medication_group value
"""
examples = [
lx.data.ExampleData(
text="Patient takes Aspirin 100mg daily for heart health and Simvastatin 20mg at bedtime.",
extractions=[
lx.data.Extraction(extraction_class="medication", extraction_text="Aspirin", attributes={"medication_group": "Aspirin"}),
lx.data.Extraction(extraction_class="dosage", extraction_text="100mg", attributes={"medication_group": "Aspirin"}),
lx.data.Extraction(extraction_class="frequency", extraction_text="daily", attributes={"medication_group": "Aspirin"}),
lx.data.Extraction(extraction_class="condition", extraction_text="heart health", attributes={"medication_group": "Aspirin"}),
lx.data.Extraction(extraction_class="medication", extraction_text="Simvastatin", attributes={"medication_group": "Simvastatin"}),
lx.data.Extraction(extraction_class="dosage", extraction_text="20mg", attributes={"medication_group": "Simvastatin"}),
lx.data.Extraction(extraction_class="frequency", extraction_text="at bedtime", attributes={"medication_group": "Simvastatin"})
]
)
]
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt_description,
examples=examples,
language_model_type=lx.inference.OpenAILanguageModel,
model_id="gpt-4o",
api_key=os.environ.get('OPENAI_API_KEY'),
fence_output=True,
use_schema_constraints=False
)
Step 6: Visualizing Results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html_object = lx.visualize("extraction_results.jsonl")
custom_css = """
<style>
body { background-color: white !important; color: black !important; }
.entity.character { background-color: #DDEEFF !important; color: black !important; }
.entity.emotion { background-color: #C0F5C0 !important; color: black !important; }
.entity.relationship { background-color: #FFEDB3 !important; color: black !important; }
</style>
"""
html = custom_css + html_object.data
with open("visualization.html", "w") as f:
f.write(html)
from IPython.display import HTML
HTML(html)
Practical Applications
Literature Analysis
News Entity Extraction
Medical Record Processing
Business Intelligence
Troubleshooting
API Errors: Check API key setup.
Bad Results: Improve prompts or add more examples.
Visualization Not Working: Recheck file paths and HTML write logic.
References
LangExtract Github : https://github.com/google/langextract
LangExtract Build Fast With AI Colab Notebook : https://git.new/LangExtract
Conclusion
LangExtract empowers users to extract structured insights from text with ease, leveraging AI models to handle complex analysis tasks. By defining clear prompts, providing examples, and using interactive visualizations, you can unlock the potential of text data in literature, news, medical records, and more.
Whether you’re a beginner exploring AI or a developer building advanced applications, LangExtract in Google Colab offers a seamless way to get started. Dive in, experiment with your own texts, and discover the power of structured text analysis!
Resources and Community
Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.
Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai
Instagram: instagram.com/buildfastwithai
Twitter (X): x.com/BuildFastWithAI
Telegram: t.me/BuildFastWithAI
AI That Keeps You Ahead
Get the latest AI insights, tools, and frameworks delivered to your inbox. Join builders who stay ahead of the curve.
You Might Also Like

Building with LLMs: A Practical Guide to API Integration
This blog explores the most popular large language models and their integration capabilities for building chatbots, natural language search, and other LLM-based products. We’ll also explain how to choose the right LLM for your business goals and examine real-world use cases.

Open Interpreter: Local Code Execution with LLMs
Discover how to harness the power of Large Language Models (LLMs) for local code execution! Learn to generate, execute, and debug Python code effortlessly, streamline workflows, and enhance productivity. Dive into practical examples, real-world applications, and expert tips in this guide!

Guardrails with LangChain: A Comprehensive Guide
This blog explores integrating Guardrails with LangChain to enforce structured and reliable NLP outputs. It covers setup, schema creation, and pipeline building, with real-world applications like content management, e-commerce, and data automation to enhance AI reliability and usability.