buildfastwithai

SentenceTransformers: Semantic Similarity and Clustering

5 min read
Published
SentenceTransformers: Semantic Similarity and Clustering
SentenceTransformers: Semantic Similarity and Clustering - BuildFast with AI

Don't get left behind in the AI revolution—are you ready to lead the charge?

Sign up for Gen AI Launch Pad 2024 and transform your ideas into reality. Be a pioneer, not a spectator.

Introduction

SentenceTransformers provides a simple yet powerful framework for generating embeddings, which are numerical representations of sentences. These embeddings are widely used in text classification, clustering, search, and retrieval tasks. By the end of this blog, you’ll understand how to:

  • Generate embeddings using pre-trained models.
  • Apply embeddings for text summarization, clustering, and semantic similarity.
  • Perform multilingual sentence analysis.
  • Utilize clustering techniques to group sentences based on their meaning.

Getting Started

Importing Necessary Libraries

Before diving in, ensure you have all the required libraries installed. Use the following commands to install the dependencies:

!pip install sentence-transformers lexrank

Once installed, import the essential packages for NLP and data manipulation:

import nltk
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

You’ll also need to download the necessary NLTK data for tokenization:

nltk.download('punkt')

Text Summarization with LexRank

Text summarization condenses long pieces of text into shorter summaries while preserving the key points. Here, we’ll use LexRank, a graph-based unsupervised algorithm, to summarize a document.

Implementation

First, we import LexRank and define the input text:

from lexrank import LexRank
from lexrank.mappings.stopwords import STOPWORDS

# Example document
document = """
New York City (NYC), often called simply New York, is the most populous city in the United States. The city is known for its cultural diversity and iconic landmarks like the Statue of Liberty and Central Park.
"""

# Tokenize sentences
sentences = nltk.sent_tokenize(document)

# Initialize LexRank
lexrank = LexRank(sentences, stopwords=STOPWORDS['en'])

# Generate summary
summary = lexrank.get_summary(sentences, summary_size=1)
print("\n".join(summary))

Output

New York City (NYC), often called simply New York, is the most populous city in the United States.

Explanation

LexRank calculates the importance of each sentence in the text using a graph-based ranking algorithm. By identifying the most representative sentences, it generates concise summaries.

Real-World Application

Text summarization is widely used in journalism, legal documents, and research papers to provide quick overviews of lengthy content.

Generating Sentence Embeddings

Embeddings are numerical representations of text that capture its semantic meaning. SentenceTransformers makes it easy to generate high-quality embeddings for sentences and documents.

Loading a Pre-Trained Model

SentenceTransformers provides several pre-trained models. For this example, we’ll use the “All-MiniLM-L6-v2” model:

model = SentenceTransformer("all-MiniLM-L6-v2")

Encoding Sentences

To convert sentences into embeddings, use the encode method:

sentences = [
    "The weather is lovely today.",
    "It’s so sunny outside!",
    "He drove to the market.",
    "The market was busy with shoppers."
]

embeddings = model.encode(sentences)
print(embeddings.shape)

Output

(4, 384)

Each sentence is represented as a 384-dimensional vector. These vectors capture semantic information, enabling various NLP tasks.

Detailed Explanation

  • SentenceTransformer: Provides the pre-trained model.
  • encode: Converts each sentence into an embedding vector.

Real-World Application

Sentence embeddings are essential for building recommendation systems, semantic search engines, and chatbot applications.

Semantic Similarity

Semantic similarity measures how closely two pieces of text relate in meaning. This is particularly useful for tasks like duplicate detection and paraphrase identification.

Implementation

sentence1 = "I love reading books."
sentence2 = "I enjoy reading novels."

# Generate embeddings
embedding1 = model.encode([sentence1])
embedding2 = model.encode([sentence2])

# Calculate similarity
similarity = cosine_similarity(embedding1, embedding2)
print("Cosine Similarity:", similarity[0][0])

Output

Cosine Similarity: 0.85

Explanation

  • Cosine Similarity: Measures the cosine of the angle between two vectors, representing their similarity. Higher values indicate closer meanings.

Real-World Application

Semantic similarity is widely used in plagiarism detection, text deduplication, and FAQ matching systems.

Clustering Sentences

Clustering groups sentences with similar meanings, enabling tasks like topic modeling and document organization.

Implementation

sentences = [
    "I love ice cream.",
    "Ice cream is delicious.",
    "I enjoy swimming.",
    "Swimming is a great exercise."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(embeddings)

# Assign clusters
clusters = kmeans.labels_
for i, cluster in enumerate(clusters):
    print(f"Sentence: {sentences[i]} - Cluster: {cluster}")

Output

Sentence: I love ice cream. - Cluster: 0
Sentence: Ice cream is delicious. - Cluster: 0
Sentence: I enjoy swimming. - Cluster: 1
Sentence: Swimming is a great exercise. - Cluster: 1

Explanation

  • KMeans: A clustering algorithm that groups similar data points.
  • Cluster Labels: Indicates the cluster assigned to each sentence.

Real-World Application

Clustering is invaluable for organizing customer feedback, grouping similar documents, and performing market research analysis.

Multilingual Sentence Embeddings

SentenceTransformers supports multilingual embeddings, enabling applications across different languages.

Implementation

# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Define multilingual sentences
sentences = ["I love programming.", "Me encanta programar.", "J'aime programmer."]

# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)

Output

(3, 768)

Explanation

  • Multilingual embeddings capture semantic similarity across different languages, enabling cross-lingual applications.

Real-World Application

Multilingual embeddings are used for machine translation, cross-lingual search, and global content analysis.

Conclusion

SentenceTransformers is a versatile library that empowers developers to handle various NLP tasks with ease. By leveraging its capabilities, you can perform tasks like semantic similarity, clustering, and multilingual analysis efficiently. Whether you're building chatbots, search engines, or recommendation systems, SentenceTransformers offers the tools to succeed.

Resources

---------------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI implementation.Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

SentenceTransformers: Semantic Similarity and Clustering - Build Fast with AI