SentenceTransformers: Semantic Similarity and Clustering
SentenceTransformers, a Python library, generates sentence embeddings for tasks like semantic similarity, clustering, and summarization. Built on models like BERT, it captures sentence meaning efficiently, enabling use cases such as search engines, topic clustering, and text summarization.

Don't get left behind in the AI revolution—are you ready to lead the charge?
Sign up for Gen AI Launch Pad 2024 and transform your ideas into reality. Be a pioneer, not a spectator.
Introduction
SentenceTransformers provides a simple yet powerful framework for generating embeddings, which are numerical representations of sentences. These embeddings are widely used in text classification, clustering, search, and retrieval tasks. By the end of this blog, you’ll understand how to:
- Generate embeddings using pre-trained models.
- Apply embeddings for text summarization, clustering, and semantic similarity.
- Perform multilingual sentence analysis.
- Utilize clustering techniques to group sentences based on their meaning.
Getting Started
Importing Necessary Libraries
Before diving in, ensure you have all the required libraries installed. Use the following commands to install the dependencies:
!pip install sentence-transformers lexrank
Once installed, import the essential packages for NLP and data manipulation:
import nltk import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity from sklearn.cluster import KMeans
You’ll also need to download the necessary NLTK data for tokenization:
nltk.download('punkt')
Text Summarization with LexRank
Text summarization condenses long pieces of text into shorter summaries while preserving the key points. Here, we’ll use LexRank, a graph-based unsupervised algorithm, to summarize a document.
Implementation
First, we import LexRank and define the input text:
from lexrank import LexRank from lexrank.mappings.stopwords import STOPWORDS # Example document document = """ New York City (NYC), often called simply New York, is the most populous city in the United States. The city is known for its cultural diversity and iconic landmarks like the Statue of Liberty and Central Park. """ # Tokenize sentences sentences = nltk.sent_tokenize(document) # Initialize LexRank lexrank = LexRank(sentences, stopwords=STOPWORDS['en']) # Generate summary summary = lexrank.get_summary(sentences, summary_size=1) print("\n".join(summary))
Output
New York City (NYC), often called simply New York, is the most populous city in the United States.
Explanation
LexRank calculates the importance of each sentence in the text using a graph-based ranking algorithm. By identifying the most representative sentences, it generates concise summaries.
Real-World Application
Text summarization is widely used in journalism, legal documents, and research papers to provide quick overviews of lengthy content.
Generating Sentence Embeddings
Embeddings are numerical representations of text that capture its semantic meaning. SentenceTransformers makes it easy to generate high-quality embeddings for sentences and documents.
Loading a Pre-Trained Model
SentenceTransformers provides several pre-trained models. For this example, we’ll use the “All-MiniLM-L6-v2” model:
model = SentenceTransformer("all-MiniLM-L6-v2")
Encoding Sentences
To convert sentences into embeddings, use the encode
method:
sentences = [ "The weather is lovely today.", "It’s so sunny outside!", "He drove to the market.", "The market was busy with shoppers." ] embeddings = model.encode(sentences) print(embeddings.shape)
Output
(4, 384)
Each sentence is represented as a 384-dimensional vector. These vectors capture semantic information, enabling various NLP tasks.
Detailed Explanation
- SentenceTransformer: Provides the pre-trained model.
- encode: Converts each sentence into an embedding vector.
Real-World Application
Sentence embeddings are essential for building recommendation systems, semantic search engines, and chatbot applications.
Semantic Similarity
Semantic similarity measures how closely two pieces of text relate in meaning. This is particularly useful for tasks like duplicate detection and paraphrase identification.
Implementation
sentence1 = "I love reading books." sentence2 = "I enjoy reading novels." # Generate embeddings embedding1 = model.encode([sentence1]) embedding2 = model.encode([sentence2]) # Calculate similarity similarity = cosine_similarity(embedding1, embedding2) print("Cosine Similarity:", similarity[0][0])
Output
Cosine Similarity: 0.85
Explanation
- Cosine Similarity: Measures the cosine of the angle between two vectors, representing their similarity. Higher values indicate closer meanings.
Real-World Application
Semantic similarity is widely used in plagiarism detection, text deduplication, and FAQ matching systems.
Clustering Sentences
Clustering groups sentences with similar meanings, enabling tasks like topic modeling and document organization.
Implementation
sentences = [ "I love ice cream.", "Ice cream is delicious.", "I enjoy swimming.", "Swimming is a great exercise." ] # Generate embeddings embeddings = model.encode(sentences) # Apply KMeans clustering kmeans = KMeans(n_clusters=2) kmeans.fit(embeddings) # Assign clusters clusters = kmeans.labels_ for i, cluster in enumerate(clusters): print(f"Sentence: {sentences[i]} - Cluster: {cluster}")
Output
Sentence: I love ice cream. - Cluster: 0 Sentence: Ice cream is delicious. - Cluster: 0 Sentence: I enjoy swimming. - Cluster: 1 Sentence: Swimming is a great exercise. - Cluster: 1
Explanation
- KMeans: A clustering algorithm that groups similar data points.
- Cluster Labels: Indicates the cluster assigned to each sentence.
Real-World Application
Clustering is invaluable for organizing customer feedback, grouping similar documents, and performing market research analysis.
Multilingual Sentence Embeddings
SentenceTransformers supports multilingual embeddings, enabling applications across different languages.
Implementation
# Load multilingual model model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') # Define multilingual sentences sentences = ["I love programming.", "Me encanta programar.", "J'aime programmer."] # Generate embeddings embeddings = model.encode(sentences) print(embeddings.shape)
Output
(3, 768)
Explanation
- Multilingual embeddings capture semantic similarity across different languages, enabling cross-lingual applications.
Real-World Application
Multilingual embeddings are used for machine translation, cross-lingual search, and global content analysis.
Conclusion
SentenceTransformers is a versatile library that empowers developers to handle various NLP tasks with ease. By leveraging its capabilities, you can perform tasks like semantic similarity, clustering, and multilingual analysis efficiently. Whether you're building chatbots, search engines, or recommendation systems, SentenceTransformers offers the tools to succeed.
Resources
- SentenceTransformers Documentation
- LexRank GitHub
- Sentence Transformer Build Fast With AI Detailed NoteBook
---------------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.
AI That Keeps You Ahead
Get the latest AI insights, tools, and frameworks delivered to your inbox. Join builders who stay ahead of the curve.
You Might Also Like

How FAISS is Revolutionizing Vector Search: Everything You Need to Know
Discover FAISS, the ultimate library for fast similarity search and clustering of dense vectors! This in-depth guide covers setup, vector stores, document management, similarity search, and real-world applications. Master FAISS to build scalable, AI-powered search systems efficiently! 🚀

Smolagents a Smol Library to build great Agents
In this blog post, we delve into smolagents, a powerful library designed to build intelligent agents with code. Whether you're a machine learning enthusiast or a seasoned developer, this guide will help you explore the capabilities of smolagents, showcasing practical applications and use cases.

Building with LLMs: A Practical Guide to API Integration
This blog explores the most popular large language models and their integration capabilities for building chatbots, natural language search, and other LLM-based products. We’ll also explain how to choose the right LLM for your business goals and examine real-world use cases.