Methodology

How the Semantic Map of HuggingFace Datasets is built

Data as of April 2026

Overview #

This visualization maps the top 5,000 most-liked HuggingFace datasets onto a 2D plane, positioned by the semantic similarity of their dataset cards (the README associated with each repo). Datasets that describe similar content and purpose appear near each other, revealing natural clusters across the open-data ecosystem.

The map is generated by a pipeline that fetches dataset metadata and cards from the HuggingFace Hub, embeds the card text into high-dimensional vectors, reduces those vectors to two dimensions, applies hierarchical topic labeling, augments each card with LLM-extracted structured fields and short summaries, and renders an interactive HTML visualization.

Corpus Collection #

The pipeline uses a single-stage enumeration: a list_datasets call to the HuggingFace Hub API requests a generous overshoot of 6,000 dataset entries, sorted by the likes field. Each entry's metadata (id, author, likes, downloads, language, license, size category, modality, task category, last_modified, created_at) is captured in the same response. README cards are then downloaded individually via hf_hub_download, with retry-with-backoff for 429/5xx responses.

Cards shorter than 200 characters after stripping YAML frontmatter and whitespace are excluded. The remaining cards are sorted by likes and the top 5,000 are kept. The enumeration step is deliberately not BigQuery-style: HF Hub's list_datasets already returns a ranked, filterable result set, so a separate candidate pre-pass is unnecessary. The likes ranking surfaces a community-curated, mostly NLP-leaning slice of the Hub; ranking by downloads instead pulls in vision/robotics/pipeline-plumbing data with median 0 likes (the top-1K overlap between the two rankings is only ~17%).

Each card is processed by Claude Haiku in two independent passes. The first extracts eight structured fields against a fixed taxonomy (subject domain, training stage, format, provenance, etc.). The second writes a single-sentence (≤25-word) TL;DR summary. Both feed into hover tooltips, search, and colormap categories. The raw card text — not the summaries — is what gets embedded to drive map placement.

Processing Pipeline #

0. Fetch datasets

Calls HfApi.list_datasets(sort="likes", limit=6000, full=True), then downloads each card via hf_hub_download in parallel (8 concurrent workers, retry-with-exponential-backoff on 429/5xx). Strips YAML frontmatter, drops cards under 200 characters, sorts by likes, keeps the top 5,000. Resumable: rerunning re-uses any cards already saved in data/datasets.parquet unless --refresh is passed.

1. Embed cards

Encodes each card (truncated to 4,000 characters) into a 512-dimensional vector using Cohere's embed-v4.0 model with input_type="clustering", in batches of 96.

2. Reduce to 2D

Applies UMAP (n_neighbors=15, min_dist=0.05, metric="cosine", random_state=42) to project the 512-dimensional embeddings down to 2D coordinates for map placement.

3. Label topics

Uses the Toponymy library for hierarchical density-based clustering (min_clusters=4, lowest_detail_level=0.5, highest_detail_level=1.0), then sends representative documents from each cluster to Claude Sonnet 4 to generate human-readable topic labels at multiple levels of detail. Documents passed to the labeler are composites of pretty_name, repo_id, selected metadata tags, and a card excerpt (up to 2,000 characters).

4. Extract structured fields

Sends each card to Claude Haiku 4.5 with a system prompt assembled from pipeline/taxonomy.json. Eight fields are extracted per card: provenance_method, subject_domain, training_stage, format_convention, special_characteristics, geo_scope, upstream_models, and is_benchmark. Each value is paired with a short verbatim quote from the card. Resumable via per-repo JSON files in data/structured_fields_cache/.

4b. Summarize cards

Sends each card to Claude Haiku 4.5 to produce a single-sentence (≤25-word) TL;DR summary. Independent of step 4 (different prompt, different output, different cache directory). Resumable via per-repo JSON files in data/summaries_cache/.

5. Visualize

Combines coordinates, topic labels, structured fields, summaries, and HF metadata into an interactive HTML map using DataMapPlot, with multiple colormaps, search, hover tooltips, click-to-open functionality, and an injected advanced-filters panel.

Tools & Technologies #

Tool	Role
HuggingFace Hub API	Dataset enumeration and card download
Cohere `embed-v4.0`	Card text embedding (512 dimensions)
UMAP	Dimensionality reduction from 512D to 2D
Toponymy	Hierarchical density-based topic labeling
DataMapPlot	Interactive HTML map rendering
Claude Sonnet 4	Topic label generation (inside Toponymy)
Claude Haiku 4.5	Structured-field extraction and TL;DR summarization

Notable Parameters #

Key parameter values used across the pipeline. These are the authoritative reference; some also appear inline in the step descriptions above.

Parameter	Value	Notes
Corpus (step 0)
Rank signal	`likes`	Single field passed to `HfApi.list_datasets(sort=...)`
Target count	5,000	Final corpus size after filtering
Fetch overshoot	6,000	Ask for more so short-card filtering doesn't shrink below target
Min card length	200 chars	After YAML stripping; shorter cards excluded
Card truncation	4,000 chars	For embedding and storage
Embedding (step 1)
Model	Cohere embed-v4.0	`embed-v4.0`, `input_type="clustering"`
Dimensions	512	Output vector size
Batch size	96	Cards per API call
UMAP dimensionality reduction (step 2)
`n_neighbors`	15	Local neighborhood size
`min_dist`	0.05	Controls tightness of clusters in 2D
Metric	`cosine`	Distance metric
`random_state`	42	For reproducibility
Topic labeling (step 3)
Model	Claude Sonnet 4	`claude-sonnet-4-20250514`
`min_clusters`	4	Toponymy clusterer minimum
Detail levels	0.5–1.0	`lowest_detail_level` to `highest_detail_level`
Object description	"HuggingFace dataset cards"	Passed to Toponymy LLM wrapper
Corpus description	"collection of the top 5,000 HuggingFace datasets ranked by likes"	Passed to Toponymy LLM wrapper
Structured-field extraction (step 4)
Model	Claude Haiku 4.5	`claude-haiku-4-5-20251001`
Card truncation	6,000 chars	Per call, before sending to the model
Concurrency	12	Async semaphore
Cache control	ephemeral	System prompt is cache-tagged for prompt caching
Card summarization (step 4b)
Model	Claude Haiku 4.5	`claude-haiku-4-5-20251001`
Card truncation	4,000 chars	Shorter than extract since TL;DRs come from the opening
Max words	25	Hard budget enforced by prompt
Concurrency	12	Async semaphore

LLM Prompts #

Structured-field extraction (step 4)

The system prompt is assembled at runtime from pipeline/taxonomy.json, which defines eight fields and their allowed slugs. For each field the prompt enumerates the slugs and their descriptions; the user message contains the dataset id and the truncated card body. The model returns a JSON object with one entry per field, where each entry has both a value (a slug, list of slugs, or boolean) and a short verbatim quote from the card that justified the choice. The scaffold below shows the system-prompt structure; field definitions are sourced from pipeline/taxonomy.json in the repo.

You extract constrained structured metadata from HuggingFace dataset cards.

RULES:
- For slug-valued fields, return one of the provided slugs verbatim. No paraphrases, no combined values like 'a / b'.
- For LIST-typed fields, the `value` MUST be a JSON array even if only one item applies: `["item"]`. Never a bare string.
- Each field captures a DIFFERENT axis. Do NOT reuse a slug from one field as the value for another field. Axis definitions: `subject_domain` = what the data is ABOUT (noun); `provenance_method` = HOW the data was created; `training_stage` = what the data is FOR in the training stack; `format_convention` = the SCHEMA SHAPE of each record; `special_characteristics` = orthogonal PROPERTIES (long-context, roleplay, multilingual-parallel, reasoning-traces, etc.). […]
- For each field, include `quote`: a ≤25-word verbatim span from the card that justified your choice. If silent, use the sentinel 'not_stated' for the quote.
- Output strictly valid JSON. No prose, no markdown fences, no commentary outside the JSON object.

FIELD DEFINITIONS:

FIELD: provenance_method (pick EXACTLY ONE slug)
  - human-created: Primary-source content written, recorded, or annotated by humans …
  - web-scraped: Harvested from public web sources …
  - llm-generated: Content produced by an LLM or generative model …
  [… 8 more slugs …]

FIELD: subject_domain (pick EXACTLY ONE slug)
  - general-web-text, instruction-and-chat, code-and-software, math-and-reasoning,
    scientific-research, medical-and-biomedical, natural-images-and-video,
    generated-media, speech-and-audio, legal-and-policy, finance-and-business,
    safety-alignment, robotics-and-embodied, 3d-and-simulation, agent-and-tool-use,
    entertainment-and-gaming, geospatial, multi-domain, rag-evaluation, other, not_stated

FIELD: training_stage (return a JSON LIST of applicable slugs …)
  - pretraining, sft, preference, eval, domain-finetune, raw-corpus,
    not_applicable, other, not_stated

FIELD: format_convention (pick EXACTLY ONE slug)
  - sharegpt, alpaca, preference-pairs, prompt-completion, vqa, image-caption,
    audio-transcript, raw-text, tabular, structured-record, other, not_stated

FIELD: special_characteristics (return a JSON LIST …)
  - long-context, multi-turn, adversarial, reasoning-traces, low-resource-language,
    multilingual-parallel, roleplay, long-form-generation

FIELD: geo_scope (return a JSON LIST)
  RULE: country/region names if explicit; ['global'] only if explicit;
        ['not_applicable'] for content with no meaningful geography;
        ['not_stated'] is the default if silent.

FIELD: upstream_models (return a JSON LIST of raw strings)
  RULE: Models explicitly mentioned as having generated this data
        (e.g. 'GPT-4', 'Claude 3.5 Sonnet'). ['not_applicable'] / ['not_stated'] otherwise.

FIELD: is_benchmark (return true or false)
  RULE: true only if the dataset is explicitly intended as an evaluation benchmark.

OUTPUT SHAPE (fill in values; do not change the structure):
{
  "provenance_method": { "value": <per field rule>, "quote": "<≤25-word span or sentinel>" },
  "subject_domain":    { "value": … },
  …
}

Card summarization (step 4b)

Each card (truncated to 4,000 characters) is sent to Claude Haiku 4.5 with the system prompt below. The model returns a JSON object with a single summary field containing one sentence of at most 25 words.

You write short, specific TL;DR summaries of HuggingFace datasets.

Your summary MUST:
- Be a single sentence of ≤25 words.
- Describe what the dataset IS directly, using a specific noun phrase as the opening.
- Mention what makes it distinctive: origin, scale, methodology, unique property, or source.
- Be self-contained — a reader should understand the dataset without any other context.

Your summary MUST NOT:
- Start with "This dataset…", "A dataset of…", "This is…", or similar filler openings.
- Exceed 25 words.
- Be generic (e.g. "A text classification dataset" is useless — say what it classifies and why it's interesting).
- Include marketing prose or hedging ("powerful", "comprehensive", "may be useful for…").

Good example summaries:
- "12 million YouTube Music track links auto-discovered by recursively walking 'fans might also like' suggestions from a seed of 45,000 artists."
- "Japanese translation of LLaVA-Instruct-150K via DeepL, for Japanese vision-language instruction tuning."
- "One million anonymized real-estate listings from Divar, Iran's largest classifieds platform, with 57 columns of price and location detail."
- "31 English short-answer questions on communication networks, with reference answers and scored student responses for feedback-generation training."

Output strictly valid JSON: {"summary": "<your sentence>"}. No prose, no markdown fences, nothing outside the JSON.

Topic labeling (step 3)

Topic labels are generated by the Toponymy library, which constructs its own LLM prompts internally. The pipeline configures Toponymy with values that shape those prompts: the object_description is set to "HuggingFace dataset cards" and the corpus_description to "collection of the top 5,000 HuggingFace datasets ranked by likes". Exemplar documents are delimited with triple-quoted Python docstring markers ( * """…"""). Toponymy generates labels at detail levels between 0.5 and 1.0, with the number of hierarchical layers determined automatically by the clustering algorithm (min_clusters=4). Claude Sonnet 4 serves as the LLM backend for generating human-readable topic names.

Using the Visualization #

Pan and zoom: Click and drag to pan; scroll to zoom in and out.
Hover: Hover over any point to see the dataset id, an LLM-written TL;DR summary, popularity stats (likes, downloads, size), the LLM-extracted subject pill, and a 2-column grid of metadata (role, task, training stage, modality, language, provenance, format) with license and last-modified date in the footer.
Click: Click any point to open the dataset on huggingface.co in a new tab.
Search: Use the search box to find specific datasets by id, summary text, or any of the bucketed taxonomy fields.
Filter panel: Open the sidebar to narrow the visible points with range sliders (likes, downloads, created year, days since modified) and checkbox filters (task category, modality, license, size, language, subject domain, provenance, training stage, format convention, role).
Colormaps: Switch between colormaps using the dropdown to color points by HuggingFace-reported metadata (task, modality, license, size, language, likes, downloads), Toponymy topic layers, or LLM-extracted fields (subject domain, provenance, training stage, format convention, role).
Topic labels: Cluster labels appear on the map at multiple levels of detail, from broad categories down to specific sub-topics. The number of levels is chosen automatically by the labeling algorithm.

Field Definitions #

Colormap	Description
Task Category	The first `task_categories` tag the dataset card declares (e.g. text-classification, question-answering). The top 9 are shown individually; all others are grouped as "Other".
Modality	The first `modalities` tag the dataset card declares (text, image, audio, video, etc.). The top 9 are shown individually; all others are grouped as "Other".
License	Licenses grouped into families: MIT, Apache, GPL Family, BSD, MPL, CC BY, CC ShareAlike, CC NonCommercial, CC0 / Public Domain, CC (other), ODC / ODbL, OpenRAIL, Llama License, Gemma License, Other Permissive, Other.
Size Category	An ordinal bucket derived from the dataset's `size_categories` tag: <1K, 1K–10K, 10K–100K, 100K–1M, 1M–10M, 10M–100M, 100M–1B, 1B–10B, 10B–100B, 100B–1T, >1T, Unknown.
Language	A single label per dataset: Multilingual if multiple languages or the multilinguality tag says so; Translation for parallel/translation corpora; otherwise the human-readable name of the single declared language. The top 9 are shown individually; all others are grouped as "Other".
Likes (log10)	Base-10 logarithm of the dataset's HuggingFace like count. Log scale spreads out the long tail of low-likes datasets while preserving differences among the most-liked.
Downloads (log10)	Base-10 logarithm of the dataset's all-time download count.
Subject Domain (LLM)	What the data is ABOUT, classified by Claude Haiku from a fixed taxonomy. Values include general-web-text, instruction-and-chat, code-and-software, math-and-reasoning, scientific-research, medical-and-biomedical, natural-images-and-video, generated-media, speech-and-audio, legal-and-policy, finance-and-business, safety-alignment, robotics-and-embodied, 3d-and-simulation, agent-and-tool-use, entertainment-and-gaming, geospatial, multi-domain, rag-evaluation, other, not_stated.
Provenance (LLM)	HOW the data was created. Values: human-created, web-scraped, sensor-recorded, llm-generated, translated, filtered-subset, remix, algorithmically-derived, mixed, other, not_stated.
Training Stage (LLM)	The dataset's role in the training stack. A primary stage is picked per dataset using a priority order (preference > sft > eval > domain-finetune > pretraining > raw-corpus). Underlying multi-select values are preserved in the parquet and shown as a comma-joined list in hover.
Format Convention (LLM)	The schema shape of each record. Values: sharegpt, alpaca, preference-pairs, prompt-completion, vqa, image-caption, audio-transcript, raw-text, tabular, structured-record, other, not_stated.
Role: Benchmark vs Training (LLM)	Derived from the `is_benchmark` boolean: Benchmark, Training data, or Unknown.

Prior Work #

A close prior example is the ArXiv Data Map, which uses the same embed–reduce–label–render pipeline on 2.4 million ArXiv papers. The pipeline structure here is the same; the differences are the corpus (HF dataset cards instead of ArXiv abstracts), an additional LLM-extraction stage that produces structured fields against a fixed taxonomy, and per-card LLM-written TL;DR summaries.

More broadly, this project sits in a tradition of spatially mapping knowledge domains. Katy Börner's UCSD Map of Science, described in Atlas of Science, mapped 7.2 million scientific publications across 554 subdisciplines using bibliometric citation networks — an early demonstration that large document corpora can be organized usefully in two dimensions.

HuggingFace's own dataset hub UI is the natural baseline for browsing datasets, but it is a list-and-filter interface, not a spatial one. Two datasets that are conceptually neighbors (say, two Japanese-translation instruction sets) might appear pages apart there while sharing the same patch on this map. The map is meant as a complement, not a replacement.

A sister project applies the same pipeline to the top 10,000 most-starred GitHub repositories, with a different LLM-extraction taxonomy (project type, target audience) suited to source repositories rather than datasets.

Acknowledgements #

Built on open-source tools from the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering. HuggingFace provides the underlying dataset metadata and card content through the Hub API.

Limitations #

This is a map of HuggingFace datasets, not the map. As with any map, the map is not the territory. It's a projection shaped by specific choices about what to measure, how to measure it, and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.

Embedding model dependence. The embedding model is the single biggest determinant of what "similarity" means on this map. Every downstream step (UMAP layout, clustering, topic labels) inherits the geometry it defines. Cohere's embed-v4.0 was trained on general web and document data, so it captures natural-language similarity well but may underweight signals a dataset author would consider important: the actual schema of the records, embedded JSON examples, or the statistical distribution of labels. A different embedding model, especially a multimodal or schema-aware one, could produce a meaningfully different map from the same inputs.
Selection bias from the likes signal. The map includes only the top 5,000 datasets by HuggingFace likes. Likes correlate with community visibility and English-language documentation, so the corpus systematically over-represents community-curated, NLP-leaning datasets and under-represents domain-specific or freshly-uploaded ones. Ranking by downloads instead would surface a different population (more vision/robotics/pipeline-plumbing data with median 0 likes); the top-1K overlap between the two rankings is only ~17%.
Card-driven placement. Datasets are positioned by what their card says, not by what the data contains. A dataset with a sparse or misleading card will be misplaced, and cards shorter than 200 characters are excluded entirely. Cards are also truncated to 4,000 characters before embedding, so content deeper in the file has no influence on placement.
Dimensionality reduction artifacts. UMAP does a good job preserving both local and global structure, but projecting 512 dimensions onto 2 inevitably loses information. Nearby points on the map genuinely have similar cards, and the relative positions of clusters are broadly meaningful, but inter-cluster distances should not be read as precise measurements of dissimilarity.
LLM-generated content. The TL;DR summaries (Claude Haiku), structured-field extractions (Claude Haiku), and topic labels (Claude Sonnet via Toponymy) are all generated without human review. They can oversimplify, miscategorize, or occasionally hallucinate details that aren't in the source card. The structured fields are constrained to a fixed taxonomy that itself reflects design choices about which axes to surface; some real distinctions (e.g. specific dataset families, particular author lineages) are not captured by any field.
HuggingFace metadata is uneven. Tags such as size_categories, language, license, and task_categories are author-supplied and often missing, ambiguous, or inconsistently formatted. The colormap aggregations (e.g. license families, language buckets) absorb a lot of this messiness, but a card with no tags will fall into "Unknown" or "Other" buckets even when the data itself is clearly classifiable.
Point-in-time snapshot. The map reflects a single moment. Likes, downloads, card content, and even dataset existence all change over time. There's no mechanism for incremental updates; regenerating the map requires rerunning the pipeline.
Proprietary tooling. There is some irony in using proprietary, closed-source tools (Cohere's embedding model and Anthropic's LLMs) to build a map of notable open-data projects. These choices were made for quality and convenience, but they mean the pipeline can't be fully reproduced without access to commercial APIs, and the embedding and labeling behavior is opaque.

About #

Author: Steven Fazzio
Source code: stevenfazzio/huggingface-dataset-map
Sister project: stevenfazzio/semantic-github-map — same pipeline, applied to the top 10,000 most-starred GitHub repositories
Feedback: Bug reports and feature requests are welcome. Please open an issue
License: MIT License