Methodology
How the Semantic Map of HuggingFace Datasets is built
Data as of April 2026
Overview #
This visualization maps the top 5,000 most-liked HuggingFace datasets onto a 2D plane, positioned by the semantic similarity of their dataset cards (the README associated with each repo). Datasets that describe similar content and purpose appear near each other, revealing natural clusters across the open-data ecosystem.
The map is generated by a pipeline that fetches dataset metadata and cards from the HuggingFace Hub, embeds the card text into high-dimensional vectors, reduces those vectors to two dimensions, applies hierarchical topic labeling, augments each card with LLM-extracted structured fields and short summaries, and renders an interactive HTML visualization.
Corpus Collection #
The pipeline uses a single-stage enumeration: a list_datasets call to the
HuggingFace Hub API requests a generous overshoot of 6,000 dataset entries, sorted by
the likes field. Each entry's metadata (id, author, likes, downloads,
language, license, size category, modality, task category, last_modified, created_at)
is captured in the same response. README cards are then downloaded individually via
hf_hub_download, with retry-with-backoff for 429/5xx responses.
Cards shorter than 200 characters after stripping YAML frontmatter and whitespace are
excluded. The remaining cards are sorted by likes and the top 5,000 are kept. The
enumeration step is deliberately not BigQuery-style: HF Hub's list_datasets
already returns a ranked, filterable result set, so a separate candidate pre-pass is
unnecessary. The likes ranking surfaces a community-curated, mostly NLP-leaning slice
of the Hub; ranking by downloads instead pulls in vision/robotics/pipeline-plumbing
data with median 0 likes (the top-1K overlap between the two rankings is only ~17%).
Each card is processed by Claude Haiku in two independent passes. The first extracts eight structured fields against a fixed taxonomy (subject domain, training stage, format, provenance, etc.). The second writes a single-sentence (≤25-word) TL;DR summary. Both feed into hover tooltips, search, and colormap categories. The raw card text — not the summaries — is what gets embedded to drive map placement.
Processing Pipeline #
Calls HfApi.list_datasets(sort="likes", limit=6000, full=True), then
downloads each card via hf_hub_download in parallel (8 concurrent workers,
retry-with-exponential-backoff on 429/5xx). Strips YAML frontmatter, drops cards under
200 characters, sorts by likes, keeps the top 5,000. Resumable: rerunning re-uses any
cards already saved in data/datasets.parquet unless --refresh
is passed.
Encodes each card (truncated to 4,000 characters) into a 512-dimensional vector
using Cohere's embed-v4.0 model with
input_type="clustering", in batches of 96.
Applies UMAP (n_neighbors=15, min_dist=0.05,
metric="cosine", random_state=42) to project the
512-dimensional embeddings down to 2D coordinates for map placement.
Uses the Toponymy library for hierarchical density-based clustering
(min_clusters=4, lowest_detail_level=0.5,
highest_detail_level=1.0), then sends representative documents from each
cluster to Claude Sonnet 4 to generate human-readable topic labels at multiple levels
of detail. Documents passed to the labeler are composites of pretty_name, repo_id,
selected metadata tags, and a card excerpt (up to 2,000 characters).
Sends each card to Claude Haiku 4.5 with a system prompt assembled from
pipeline/taxonomy.json. Eight fields are extracted per card:
provenance_method, subject_domain,
training_stage, format_convention,
special_characteristics, geo_scope,
upstream_models, and is_benchmark. Each value is paired with
a short verbatim quote from the card. Resumable via per-repo JSON files in
data/structured_fields_cache/.
Sends each card to Claude Haiku 4.5 to produce a single-sentence (≤25-word) TL;DR
summary. Independent of step 4 (different prompt, different output, different cache
directory). Resumable via per-repo JSON files in
data/summaries_cache/.
Combines coordinates, topic labels, structured fields, summaries, and HF metadata into an interactive HTML map using DataMapPlot, with multiple colormaps, search, hover tooltips, click-to-open functionality, and an injected advanced-filters panel.
Tools & Technologies #
| Tool | Role |
|---|---|
| HuggingFace Hub API | Dataset enumeration and card download |
Cohere embed-v4.0 | Card text embedding (512 dimensions) |
| UMAP | Dimensionality reduction from 512D to 2D |
| Toponymy | Hierarchical density-based topic labeling |
| DataMapPlot | Interactive HTML map rendering |
| Claude Sonnet 4 | Topic label generation (inside Toponymy) |
| Claude Haiku 4.5 | Structured-field extraction and TL;DR summarization |
Notable Parameters #
Key parameter values used across the pipeline. These are the authoritative reference; some also appear inline in the step descriptions above.
| Parameter | Value | Notes |
|---|---|---|
| Corpus (step 0) | ||
| Rank signal | likes | Single field passed to HfApi.list_datasets(sort=...) |
| Target count | 5,000 | Final corpus size after filtering |
| Fetch overshoot | 6,000 | Ask for more so short-card filtering doesn't shrink below target |
| Min card length | 200 chars | After YAML stripping; shorter cards excluded |
| Card truncation | 4,000 chars | For embedding and storage |
| Embedding (step 1) | ||
| Model | Cohere embed-v4.0 | embed-v4.0, input_type="clustering" |
| Dimensions | 512 | Output vector size |
| Batch size | 96 | Cards per API call |
| UMAP dimensionality reduction (step 2) | ||
n_neighbors | 15 | Local neighborhood size |
min_dist | 0.05 | Controls tightness of clusters in 2D |
| Metric | cosine | Distance metric |
random_state | 42 | For reproducibility |
| Topic labeling (step 3) | ||
| Model | Claude Sonnet 4 | claude-sonnet-4-20250514 |
min_clusters | 4 | Toponymy clusterer minimum |
| Detail levels | 0.5–1.0 | lowest_detail_level to highest_detail_level |
| Object description | "HuggingFace dataset cards" | Passed to Toponymy LLM wrapper |
| Corpus description | "collection of the top 5,000 HuggingFace datasets ranked by likes" | Passed to Toponymy LLM wrapper |
| Structured-field extraction (step 4) | ||
| Model | Claude Haiku 4.5 | claude-haiku-4-5-20251001 |
| Card truncation | 6,000 chars | Per call, before sending to the model |
| Concurrency | 12 | Async semaphore |
| Cache control | ephemeral | System prompt is cache-tagged for prompt caching |
| Card summarization (step 4b) | ||
| Model | Claude Haiku 4.5 | claude-haiku-4-5-20251001 |
| Card truncation | 4,000 chars | Shorter than extract since TL;DRs come from the opening |
| Max words | 25 | Hard budget enforced by prompt |
| Concurrency | 12 | Async semaphore |
LLM Prompts #
Structured-field extraction (step 4)
The system prompt is assembled at runtime from pipeline/taxonomy.json,
which defines eight fields and their allowed slugs. For each field the prompt enumerates
the slugs and their descriptions; the user message contains the dataset id and the
truncated card body. The model returns a JSON object with one entry per field, where
each entry has both a value (a slug, list of slugs, or boolean) and a
short verbatim quote from the card that justified the choice. The
scaffold below shows the system-prompt structure; field definitions are sourced from
pipeline/taxonomy.json in the repo.
You extract constrained structured metadata from HuggingFace dataset cards.
RULES:
- For slug-valued fields, return one of the provided slugs verbatim. No paraphrases, no combined values like 'a / b'.
- For LIST-typed fields, the `value` MUST be a JSON array even if only one item applies: `["item"]`. Never a bare string.
- Each field captures a DIFFERENT axis. Do NOT reuse a slug from one field as the value for another field. Axis definitions: `subject_domain` = what the data is ABOUT (noun); `provenance_method` = HOW the data was created; `training_stage` = what the data is FOR in the training stack; `format_convention` = the SCHEMA SHAPE of each record; `special_characteristics` = orthogonal PROPERTIES (long-context, roleplay, multilingual-parallel, reasoning-traces, etc.). […]
- For each field, include `quote`: a ≤25-word verbatim span from the card that justified your choice. If silent, use the sentinel 'not_stated' for the quote.
- Output strictly valid JSON. No prose, no markdown fences, no commentary outside the JSON object.
FIELD DEFINITIONS:
FIELD: provenance_method (pick EXACTLY ONE slug)
- human-created: Primary-source content written, recorded, or annotated by humans …
- web-scraped: Harvested from public web sources …
- llm-generated: Content produced by an LLM or generative model …
[… 8 more slugs …]
FIELD: subject_domain (pick EXACTLY ONE slug)
- general-web-text, instruction-and-chat, code-and-software, math-and-reasoning,
scientific-research, medical-and-biomedical, natural-images-and-video,
generated-media, speech-and-audio, legal-and-policy, finance-and-business,
safety-alignment, robotics-and-embodied, 3d-and-simulation, agent-and-tool-use,
entertainment-and-gaming, geospatial, multi-domain, rag-evaluation, other, not_stated
FIELD: training_stage (return a JSON LIST of applicable slugs …)
- pretraining, sft, preference, eval, domain-finetune, raw-corpus,
not_applicable, other, not_stated
FIELD: format_convention (pick EXACTLY ONE slug)
- sharegpt, alpaca, preference-pairs, prompt-completion, vqa, image-caption,
audio-transcript, raw-text, tabular, structured-record, other, not_stated
FIELD: special_characteristics (return a JSON LIST …)
- long-context, multi-turn, adversarial, reasoning-traces, low-resource-language,
multilingual-parallel, roleplay, long-form-generation
FIELD: geo_scope (return a JSON LIST)
RULE: country/region names if explicit; ['global'] only if explicit;
['not_applicable'] for content with no meaningful geography;
['not_stated'] is the default if silent.
FIELD: upstream_models (return a JSON LIST of raw strings)
RULE: Models explicitly mentioned as having generated this data
(e.g. 'GPT-4', 'Claude 3.5 Sonnet'). ['not_applicable'] / ['not_stated'] otherwise.
FIELD: is_benchmark (return true or false)
RULE: true only if the dataset is explicitly intended as an evaluation benchmark.
OUTPUT SHAPE (fill in values; do not change the structure):
{
"provenance_method": { "value": <per field rule>, "quote": "<≤25-word span or sentinel>" },
"subject_domain": { "value": … },
…
}
Card summarization (step 4b)
Each card (truncated to 4,000 characters) is sent to Claude Haiku 4.5 with the system
prompt below. The model returns a JSON object with a single summary field
containing one sentence of at most 25 words.
You write short, specific TL;DR summaries of HuggingFace datasets.
Your summary MUST:
- Be a single sentence of ≤25 words.
- Describe what the dataset IS directly, using a specific noun phrase as the opening.
- Mention what makes it distinctive: origin, scale, methodology, unique property, or source.
- Be self-contained — a reader should understand the dataset without any other context.
Your summary MUST NOT:
- Start with "This dataset…", "A dataset of…", "This is…", or similar filler openings.
- Exceed 25 words.
- Be generic (e.g. "A text classification dataset" is useless — say what it classifies and why it's interesting).
- Include marketing prose or hedging ("powerful", "comprehensive", "may be useful for…").
Good example summaries:
- "12 million YouTube Music track links auto-discovered by recursively walking 'fans might also like' suggestions from a seed of 45,000 artists."
- "Japanese translation of LLaVA-Instruct-150K via DeepL, for Japanese vision-language instruction tuning."
- "One million anonymized real-estate listings from Divar, Iran's largest classifieds platform, with 57 columns of price and location detail."
- "31 English short-answer questions on communication networks, with reference answers and scored student responses for feedback-generation training."
Output strictly valid JSON: {"summary": "<your sentence>"}. No prose, no markdown fences, nothing outside the JSON.
Topic labeling (step 3)
Topic labels are generated by the Toponymy
library, which constructs its own LLM prompts internally. The pipeline configures Toponymy with
values that shape those prompts: the object_description is set to
"HuggingFace dataset cards" and the corpus_description to "collection of
the top 5,000 HuggingFace datasets ranked by likes". Exemplar documents are delimited with
triple-quoted Python docstring markers ( * """…""").
Toponymy generates labels at detail levels between 0.5 and 1.0, with the number of
hierarchical layers determined automatically by the clustering algorithm
(min_clusters=4). Claude Sonnet 4 serves as the LLM backend for generating
human-readable topic names.
Using the Visualization #
- Pan and zoom: Click and drag to pan; scroll to zoom in and out.
- Hover: Hover over any point to see the dataset id, an LLM-written TL;DR summary, popularity stats (likes, downloads, size), the LLM-extracted subject pill, and a 2-column grid of metadata (role, task, training stage, modality, language, provenance, format) with license and last-modified date in the footer.
- Click: Click any point to open the dataset on huggingface.co in a new tab.
- Search: Use the search box to find specific datasets by id, summary text, or any of the bucketed taxonomy fields.
- Filter panel: Open the sidebar to narrow the visible points with range sliders (likes, downloads, created year, days since modified) and checkbox filters (task category, modality, license, size, language, subject domain, provenance, training stage, format convention, role).
- Colormaps: Switch between colormaps using the dropdown to color points by HuggingFace-reported metadata (task, modality, license, size, language, likes, downloads), Toponymy topic layers, or LLM-extracted fields (subject domain, provenance, training stage, format convention, role).
- Topic labels: Cluster labels appear on the map at multiple levels of detail, from broad categories down to specific sub-topics. The number of levels is chosen automatically by the labeling algorithm.
Field Definitions #
| Colormap | Description |
|---|---|
| Task Category | The first task_categories tag the dataset card declares (e.g. text-classification, question-answering). The top 9 are shown individually; all others are grouped as "Other". |
| Modality | The first modalities tag the dataset card declares (text, image, audio, video, etc.). The top 9 are shown individually; all others are grouped as "Other". |
| License | Licenses grouped into families: MIT, Apache, GPL Family, BSD, MPL, CC BY, CC ShareAlike, CC NonCommercial, CC0 / Public Domain, CC (other), ODC / ODbL, OpenRAIL, Llama License, Gemma License, Other Permissive, Other. |
| Size Category | An ordinal bucket derived from the dataset's size_categories tag: <1K, 1K–10K, 10K–100K, 100K–1M, 1M–10M, 10M–100M, 100M–1B, 1B–10B, 10B–100B, 100B–1T, >1T, Unknown. |
| Language | A single label per dataset: Multilingual if multiple languages or the multilinguality tag says so; Translation for parallel/translation corpora; otherwise the human-readable name of the single declared language. The top 9 are shown individually; all others are grouped as "Other". |
| Likes (log10) | Base-10 logarithm of the dataset's HuggingFace like count. Log scale spreads out the long tail of low-likes datasets while preserving differences among the most-liked. |
| Downloads (log10) | Base-10 logarithm of the dataset's all-time download count. |
| Subject Domain (LLM) | What the data is ABOUT, classified by Claude Haiku from a fixed taxonomy. Values include general-web-text, instruction-and-chat, code-and-software, math-and-reasoning, scientific-research, medical-and-biomedical, natural-images-and-video, generated-media, speech-and-audio, legal-and-policy, finance-and-business, safety-alignment, robotics-and-embodied, 3d-and-simulation, agent-and-tool-use, entertainment-and-gaming, geospatial, multi-domain, rag-evaluation, other, not_stated. |
| Provenance (LLM) | HOW the data was created. Values: human-created, web-scraped, sensor-recorded, llm-generated, translated, filtered-subset, remix, algorithmically-derived, mixed, other, not_stated. |
| Training Stage (LLM) | The dataset's role in the training stack. A primary stage is picked per dataset using a priority order (preference > sft > eval > domain-finetune > pretraining > raw-corpus). Underlying multi-select values are preserved in the parquet and shown as a comma-joined list in hover. |
| Format Convention (LLM) | The schema shape of each record. Values: sharegpt, alpaca, preference-pairs, prompt-completion, vqa, image-caption, audio-transcript, raw-text, tabular, structured-record, other, not_stated. |
| Role: Benchmark vs Training (LLM) | Derived from the is_benchmark boolean: Benchmark, Training data, or Unknown. |
Prior Work #
A close prior example is the ArXiv Data Map, which uses the same embed–reduce–label–render pipeline on 2.4 million ArXiv papers. The pipeline structure here is the same; the differences are the corpus (HF dataset cards instead of ArXiv abstracts), an additional LLM-extraction stage that produces structured fields against a fixed taxonomy, and per-card LLM-written TL;DR summaries.
More broadly, this project sits in a tradition of spatially mapping knowledge domains. Katy Börner's UCSD Map of Science, described in Atlas of Science, mapped 7.2 million scientific publications across 554 subdisciplines using bibliometric citation networks — an early demonstration that large document corpora can be organized usefully in two dimensions.
HuggingFace's own dataset hub UI is the natural baseline for browsing datasets, but it is a list-and-filter interface, not a spatial one. Two datasets that are conceptually neighbors (say, two Japanese-translation instruction sets) might appear pages apart there while sharing the same patch on this map. The map is meant as a complement, not a replacement.
A sister project applies the same pipeline to the top 10,000 most-starred GitHub repositories, with a different LLM-extraction taxonomy (project type, target audience) suited to source repositories rather than datasets.
Acknowledgements #
Built on open-source tools from the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering. HuggingFace provides the underlying dataset metadata and card content through the Hub API.
Limitations #
This is a map of HuggingFace datasets, not the map. As with any map, the map is not the territory. It's a projection shaped by specific choices about what to measure, how to measure it, and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.
- Embedding model dependence. The embedding model is the single biggest determinant of what "similarity" means on this map. Every downstream step (UMAP layout, clustering, topic labels) inherits the geometry it defines. Cohere's embed-v4.0 was trained on general web and document data, so it captures natural-language similarity well but may underweight signals a dataset author would consider important: the actual schema of the records, embedded JSON examples, or the statistical distribution of labels. A different embedding model, especially a multimodal or schema-aware one, could produce a meaningfully different map from the same inputs.
- Selection bias from the likes signal. The map includes only the top 5,000 datasets by HuggingFace likes. Likes correlate with community visibility and English-language documentation, so the corpus systematically over-represents community-curated, NLP-leaning datasets and under-represents domain-specific or freshly-uploaded ones. Ranking by downloads instead would surface a different population (more vision/robotics/pipeline-plumbing data with median 0 likes); the top-1K overlap between the two rankings is only ~17%.
- Card-driven placement. Datasets are positioned by what their card says, not by what the data contains. A dataset with a sparse or misleading card will be misplaced, and cards shorter than 200 characters are excluded entirely. Cards are also truncated to 4,000 characters before embedding, so content deeper in the file has no influence on placement.
- Dimensionality reduction artifacts. UMAP does a good job preserving both local and global structure, but projecting 512 dimensions onto 2 inevitably loses information. Nearby points on the map genuinely have similar cards, and the relative positions of clusters are broadly meaningful, but inter-cluster distances should not be read as precise measurements of dissimilarity.
- LLM-generated content. The TL;DR summaries (Claude Haiku), structured-field extractions (Claude Haiku), and topic labels (Claude Sonnet via Toponymy) are all generated without human review. They can oversimplify, miscategorize, or occasionally hallucinate details that aren't in the source card. The structured fields are constrained to a fixed taxonomy that itself reflects design choices about which axes to surface; some real distinctions (e.g. specific dataset families, particular author lineages) are not captured by any field.
- HuggingFace metadata is uneven. Tags such as
size_categories,language,license, andtask_categoriesare author-supplied and often missing, ambiguous, or inconsistently formatted. The colormap aggregations (e.g. license families, language buckets) absorb a lot of this messiness, but a card with no tags will fall into "Unknown" or "Other" buckets even when the data itself is clearly classifiable. - Point-in-time snapshot. The map reflects a single moment. Likes, downloads, card content, and even dataset existence all change over time. There's no mechanism for incremental updates; regenerating the map requires rerunning the pipeline.
- Proprietary tooling. There is some irony in using proprietary, closed-source tools (Cohere's embedding model and Anthropic's LLMs) to build a map of notable open-data projects. These choices were made for quality and convenience, but they mean the pipeline can't be fully reproduced without access to commercial APIs, and the embedding and labeling behavior is opaque.
About #
- Author: Steven Fazzio
- Source code: stevenfazzio/huggingface-dataset-map
- Sister project: stevenfazzio/semantic-github-map — same pipeline, applied to the top 10,000 most-starred GitHub repositories
- Feedback: Bug reports and feature requests are welcome. Please open an issue
- License: MIT License