SentenceTransformers: Semantic Similarity and Clustering

Don't get left behind in the AI revolution—are you ready to lead the charge?
Sign up for Gen AI Launch Pad 2024 and transform your ideas into reality. Be a pioneer, not a spectator.
Introduction
SentenceTransformers provides a simple yet powerful framework for generating embeddings, which are numerical representations of sentences. These embeddings are widely used in text classification, clustering, search, and retrieval tasks. By the end of this blog, you’ll understand how to:
- Generate embeddings using pre-trained models.
- Apply embeddings for text summarization, clustering, and semantic similarity.
- Perform multilingual sentence analysis.
- Utilize clustering techniques to group sentences based on their meaning.
Getting Started
Importing Necessary Libraries
Before diving in, ensure you have all the required libraries installed. Use the following commands to install the dependencies:
!pip install sentence-transformers lexrank
Once installed, import the essential packages for NLP and data manipulation:
import nltk import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity from sklearn.cluster import KMeans
You’ll also need to download the necessary NLTK data for tokenization:
nltk.download('punkt')
Text Summarization with LexRank
Text summarization condenses long pieces of text into shorter summaries while preserving the key points. Here, we’ll use LexRank, a graph-based unsupervised algorithm, to summarize a document.
Implementation
First, we import LexRank and define the input text:
from lexrank import LexRank from lexrank.mappings.stopwords import STOPWORDS # Example document document = """ New York City (NYC), often called simply New York, is the most populous city in the United States. The city is known for its cultural diversity and iconic landmarks like the Statue of Liberty and Central Park. """ # Tokenize sentences sentences = nltk.sent_tokenize(document) # Initialize LexRank lexrank = LexRank(sentences, stopwords=STOPWORDS['en']) # Generate summary summary = lexrank.get_summary(sentences, summary_size=1) print("\n".join(summary))
Output
New York City (NYC), often called simply New York, is the most populous city in the United States.
Explanation
LexRank calculates the importance of each sentence in the text using a graph-based ranking algorithm. By identifying the most representative sentences, it generates concise summaries.
Real-World Application
Text summarization is widely used in journalism, legal documents, and research papers to provide quick overviews of lengthy content.
Generating Sentence Embeddings
Embeddings are numerical representations of text that capture its semantic meaning. SentenceTransformers makes it easy to generate high-quality embeddings for sentences and documents.
Loading a Pre-Trained Model
SentenceTransformers provides several pre-trained models. For this example, we’ll use the “All-MiniLM-L6-v2” model:
model = SentenceTransformer("all-MiniLM-L6-v2")
Encoding Sentences
To convert sentences into embeddings, use the encode
method:
sentences = [ "The weather is lovely today.", "It’s so sunny outside!", "He drove to the market.", "The market was busy with shoppers." ] embeddings = model.encode(sentences) print(embeddings.shape)
Output
(4, 384)
Each sentence is represented as a 384-dimensional vector. These vectors capture semantic information, enabling various NLP tasks.
Detailed Explanation
- SentenceTransformer: Provides the pre-trained model.
- encode: Converts each sentence into an embedding vector.
Real-World Application
Sentence embeddings are essential for building recommendation systems, semantic search engines, and chatbot applications.
Semantic Similarity
Semantic similarity measures how closely two pieces of text relate in meaning. This is particularly useful for tasks like duplicate detection and paraphrase identification.
Implementation
sentence1 = "I love reading books." sentence2 = "I enjoy reading novels." # Generate embeddings embedding1 = model.encode([sentence1]) embedding2 = model.encode([sentence2]) # Calculate similarity similarity = cosine_similarity(embedding1, embedding2) print("Cosine Similarity:", similarity[0][0])
Output
Cosine Similarity: 0.85
Explanation
- Cosine Similarity: Measures the cosine of the angle between two vectors, representing their similarity. Higher values indicate closer meanings.
Real-World Application
Semantic similarity is widely used in plagiarism detection, text deduplication, and FAQ matching systems.
Clustering Sentences
Clustering groups sentences with similar meanings, enabling tasks like topic modeling and document organization.
Implementation
sentences = [ "I love ice cream.", "Ice cream is delicious.", "I enjoy swimming.", "Swimming is a great exercise." ] # Generate embeddings embeddings = model.encode(sentences) # Apply KMeans clustering kmeans = KMeans(n_clusters=2) kmeans.fit(embeddings) # Assign clusters clusters = kmeans.labels_ for i, cluster in enumerate(clusters): print(f"Sentence: {sentences[i]} - Cluster: {cluster}")
Output
Sentence: I love ice cream. - Cluster: 0 Sentence: Ice cream is delicious. - Cluster: 0 Sentence: I enjoy swimming. - Cluster: 1 Sentence: Swimming is a great exercise. - Cluster: 1
Explanation
- KMeans: A clustering algorithm that groups similar data points.
- Cluster Labels: Indicates the cluster assigned to each sentence.
Real-World Application
Clustering is invaluable for organizing customer feedback, grouping similar documents, and performing market research analysis.
Multilingual Sentence Embeddings
SentenceTransformers supports multilingual embeddings, enabling applications across different languages.
Implementation
# Load multilingual model model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') # Define multilingual sentences sentences = ["I love programming.", "Me encanta programar.", "J'aime programmer."] # Generate embeddings embeddings = model.encode(sentences) print(embeddings.shape)
Output
(3, 768)
Explanation
- Multilingual embeddings capture semantic similarity across different languages, enabling cross-lingual applications.
Real-World Application
Multilingual embeddings are used for machine translation, cross-lingual search, and global content analysis.
Conclusion
SentenceTransformers is a versatile library that empowers developers to handle various NLP tasks with ease. By leveraging its capabilities, you can perform tasks like semantic similarity, clustering, and multilingual analysis efficiently. Whether you're building chatbots, search engines, or recommendation systems, SentenceTransformers offers the tools to succeed.
Resources
- SentenceTransformers Documentation
- LexRank GitHub
- Sentence Transformer Build Fast With AI Detailed NoteBook
---------------------------------
Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.
Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?
Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.