Nomic: The AI Tool Every Data Scientist Should Be Using Right Now

Will you let others shape the future for you, or will you lead the way?

Gen AI Launch Pad 2025 is your moment to shine.

Introduction: Empowering Large-Scale Data Insights with Nomic

In the age of big data, the ability to analyze, structure, and visualize large datasets has become crucial. Nomic, an open-source platform, facilitates this process by allowing users to manage diverse datasets (text, images, audio, embeddings, and video) efficiently. Whether you're building a data science project, conducting exploratory data analysis, or performing in-depth data visualization, Nomic can provide the tools needed for these tasks.

In this blog post, we’ll walk through a series of code blocks that showcase how to leverage Nomic for various data processing tasks. By the end of this tutorial, you’ll understand how to set up Nomic, load datasets, generate embeddings, clean data, and visualize complex data structures—all using Nomic's powerful features.

Detailed Explanation: Code Walkthrough

1. Installing Nomic and Setting Up the Environment

Before we dive into data processing, you need to set up the necessary environment. The following snippet shows how to install Nomic and log in:

pip install nomic datasets
     
!nomic login
     
from google.colab import userdata
Token = userdata.get('nomic_token')
     
import nomic
nomic.cli.login(token=Token)

Explanation:

pip install nomic datasets installs the nomic and datasets libraries, which are essential for loading and processing datasets.
nomic login prompts the user to log into Nomic using their credentials, enabling access to Nomic's cloud platform for data visualization and mapping.
The token retrieval step (userdata.get('nomic_token')) is for users working within Google Colab, and it ensures they’re authenticated when accessing their Nomic account.

Expected Output: No direct output; however, successful login ensures access to Nomic's Atlas and other tools.

When to Use: You will use this step at the beginning of your Nomic-based project setup, particularly if you plan to work in a cloud-based notebook environment like Google Colab.

2. Loading and Selecting a Subset of Data

The following code loads the AG News dataset, a collection of news articles, and selects a random subset of 10,000 documents:

from datasets import load_dataset
import numpy as np

dataset = load_dataset('ag_news')['train']

max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=True).tolist()
documents = [dataset[i] for i in subset_idxs]

Explanation:

load_dataset('ag_news') loads the AG News dataset from Hugging Face, which contains 120,000 training examples of news articles categorized into four topics.
np.random.choice() selects 10,000 random indices from the dataset, allowing us to work with a manageable subset of the data.
The selected documents are stored in the documents list.

Expected Output: This step won’t produce a visual output but will store the 10,000 random documents in memory for further processing.

When to Use: Use this when you need to work with a subset of a large dataset for faster experimentation, especially when you don't need the entire dataset for training or analysis.

3. Generating Document Embeddings

Nomic allows you to convert text documents into embeddings, which are vector representations that capture the semantic meaning of the text. Here's how to generate embeddings for the selected subset of documents:

from nomic import embed
usages = []

def generate_embeddings(documents):
    batch_size = 256
    document_embeddings = []
    batch = []
    for idx, doc in enumerate(documents):
        batch.append(doc['text'])
        if (idx + 1) % batch_size == 0 or idx == len(documents):
            batch_embeddings = embed.text(texts=batch, model='nomic-embed-text-v1')
            usages.append(batch_embeddings['usage'])
            for item in batch_embeddings['embeddings']:
                document_embeddings.append(item)
            batch = []
    return np.array(document_embeddings)

document_embeddings = generate_embeddings(documents)

Explanation:

embed.text(): This function converts the text in each document into embeddings using the specified model (nomic-embed-text-v1).
The embeddings are stored in the document_embeddings list, and the usages list keeps track of the API usage for each batch.

Expected Output: The output will be a NumPy array of document embeddings with a shape that corresponds to the number of documents and the dimensionality of the embeddings.

When to Use: This step is useful when you want to convert text data into machine-readable vectors for downstream tasks like clustering, similarity search, or training machine learning models.

4. Creating an Atlas Map for Visualizing Data

Once the embeddings are generated, you can visualize them using Nomic’s Atlas. Here’s how to create a map for the AG News dataset:

import pandas as pd
from nomic import atlas

news_articles = pd.read_csv('https://raw.githubusercontent.com/nomic-ai/maps/main/data/ag_news_25k.csv')

atlas.map_data(
    data=news_articles,
    indexed_field='text',
    identifier="Example-text-dataset-news"
)

Explanation:

atlas.map_data() uploads the document embeddings to Nomic’s Atlas for visualization. By setting indexed_field='text', you tell Nomic to index the text field for easier searching and visualization.
The identifier parameter assigns a unique name to the dataset in the Atlas.

Expected Output:

A successful upload of the data will result in a map being created in Nomic’s Atlas, which can be accessed via a link.
The map visualizes the relationships between the documents and allows you to interactively explore them.

When to Use: This step is useful for visualizing and exploring relationships between high-dimensional data points, such as in the case of text embeddings. Use it when you need to analyze large datasets visually.

5. Topic Extraction from the Atlas Map

With the map created, you can extract and group topics from the dataset. This helps in identifying dominant themes in the collection of documents.

with project.wait_for_dataset_lock():
    pprint(project.maps[0].topics.group_by_topic(topic_depth=1)[0])

Explanation:

group_by_topic(topic_depth=1) groups the documents into topics at a specified depth. In this case, the depth of 1 means that only the most significant topics are identified.
wait_for_dataset_lock() ensures that no other processes are modifying the dataset while extracting topics.

Expected Output: A printed list of topics grouped by their most significant terms. The topics can be used to understand the major themes in your dataset.

When to Use: Topic modeling is useful when analyzing large datasets of unstructured text. It helps uncover hidden patterns and insights, making it ideal for exploratory analysis and content categorization.

6. Creating an Atlas Map for English-German Translations

In this step, we use the IWSLT 2014 English-German translation dataset to create another map for bilingual data.

dataset = load_dataset("bbaaaa/iwslt14-de-en", split="train")
max_documents = 50_000
selected = dataset[:max_documents]["translation"]

documents = []
for doc in selected:
    en_data = {"text": doc["en"], "en": doc["en"], "de": doc["de"], "language": "en"}
    de_data = {"text": doc["de"], "en": doc["en"], "de": doc["de"], "language": "de"}
    documents.append(en_data)
    documents.append(de_data)

project = atlas.map_data(data=documents,
                          indexed_field='text',
                          identifier='English-German 50k Translations',
                          description='50k Examples from the iwslt14-de-en dataset hosted on huggingface.',
                          embedding_model='gte-multilingual-base',
)

Explanation:

The dataset contains English-German translation pairs, which are used to create a bilingual map.
Each document is represented by both English and German text for each translation.
gte-multilingual-base is specified as the embedding model to handle multilingual data.

Expected Output: Similar to the previous step, this will create a bilingual map of the dataset in the Nomic Atlas, which can be visualized interactively.

When to Use: This is particularly useful for multilingual data visualization, enabling insights into how different language pairs relate within a large corpus.

Conclusion

In this blog, we’ve explored how to use Nomic to manage, process, and visualize large-scale datasets. From loading datasets, generating embeddings, and cleaning data, to visualizing complex relationships and extracting topics, Nomic provides a comprehensive toolkit for working with big data. By following this guide, you should now be able to set up and use Nomic to unlock powerful insights from your datasets.

Resources Section

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

Website: www.buildfastwithai.com
LinkedIn: linkedin.com/company/build-fast-with-ai/
Instagram: instagram.com/buildfastwithai/
Twitter: x.com/satvikps
Telegram: t.me/BuildFastWithAI

BuildFast Bot

Educhain

BuildFast Studio

BuildFast Bot

Educhain

BuildFast Studio

Nomic: The AI Tool Every Data Scientist Should Be Using Right Now

Introduction: Empowering Large-Scale Data Insights with Nomic

Detailed Explanation: Code Walkthrough

1. Installing Nomic and Setting Up the Environment

2. Loading and Selecting a Subset of Data

3. Generating Document Embeddings

4. Creating an Atlas Map for Visualizing Data

5. Topic Extraction from the Atlas Map

6. Creating an Atlas Map for English-German Translations

Conclusion

Resources Section

Resources and Community