Methodology

How the Semantic Map of HuggingFace Datasets is built

Data as of April 2026

Overview #

This visualization maps the top 5,000 most-liked HuggingFace datasets onto a 2D plane, positioned by the semantic similarity of their dataset cards (the README associated with each repo). Datasets that describe similar content and purpose appear near each other, revealing natural clusters across the open-data ecosystem.

The map is generated by a pipeline that fetches dataset metadata and cards from the HuggingFace Hub, embeds the card text into high-dimensional vectors, reduces those vectors to two dimensions, applies hierarchical topic labeling, augments each card with LLM-extracted structured fields and short summaries, and renders an interactive HTML visualization.

5K HF Datasets Card Text LLM Summarize LLM Extract Embed (512D) UMAP → 2D Cluster & Label Interactive Map

Corpus Collection #

The pipeline uses a single-stage enumeration: a list_datasets call to the HuggingFace Hub API requests a generous overshoot of 6,000 dataset entries, sorted by the likes field. Each entry's metadata (id, author, likes, downloads, language, license, size category, modality, task category, last_modified, created_at) is captured in the same response. README cards are then downloaded individually via hf_hub_download, with retry-with-backoff for 429/5xx responses.

Cards shorter than 200 characters after stripping YAML frontmatter and whitespace are excluded. The remaining cards are sorted by likes and the top 5,000 are kept. The enumeration step is deliberately not BigQuery-style: HF Hub's list_datasets already returns a ranked, filterable result set, so a separate candidate pre-pass is unnecessary. The likes ranking surfaces a community-curated, mostly NLP-leaning slice of the Hub; ranking by downloads instead pulls in vision/robotics/pipeline-plumbing data with median 0 likes (the top-1K overlap between the two rankings is only ~17%).

Each card is processed by Claude Haiku in two independent passes. The first extracts eight structured fields against a fixed taxonomy (subject domain, training stage, format, provenance, etc.). The second writes a single-sentence (≤25-word) TL;DR summary. Both feed into hover tooltips, search, and colormap categories. The raw card text — not the summaries — is what gets embedded to drive map placement.

Processing Pipeline #

0. Fetch datasets

Calls HfApi.list_datasets(sort="likes", limit=6000, full=True), then downloads each card via hf_hub_download in parallel (8 concurrent workers, retry-with-exponential-backoff on 429/5xx). Strips YAML frontmatter, drops cards under 200 characters, sorts by likes, keeps the top 5,000. Resumable: rerunning re-uses any cards already saved in data/datasets.parquet unless --refresh is passed.

1. Embed cards

Encodes each card (truncated to 4,000 characters) into a 512-dimensional vector using Cohere's embed-v4.0 model with input_type="clustering", in batches of 96.

2. Reduce to 2D

Applies UMAP (n_neighbors=15, min_dist=0.05, metric="cosine", random_state=42) to project the 512-dimensional embeddings down to 2D coordinates for map placement.

3. Label topics

Uses the Toponymy library for hierarchical density-based clustering (min_clusters=4, lowest_detail_level=0.5, highest_detail_level=1.0), then sends representative documents from each cluster to Claude Sonnet 4 to generate human-readable topic labels at multiple levels of detail. Documents passed to the labeler are composites of pretty_name, repo_id, selected metadata tags, and a card excerpt (up to 2,000 characters).

4. Extract structured fields

Sends each card to Claude Haiku 4.5 with a system prompt assembled from pipeline/taxonomy.json. Eight fields are extracted per card: provenance_method, subject_domain, training_stage, format_convention, special_characteristics, geo_scope, upstream_models, and is_benchmark. Each value is paired with a short verbatim quote from the card. Resumable via per-repo JSON files in data/structured_fields_cache/.

4b. Summarize cards

Sends each card to Claude Haiku 4.5 to produce a single-sentence (≤25-word) TL;DR summary. Independent of step 4 (different prompt, different output, different cache directory). Resumable via per-repo JSON files in data/summaries_cache/.

5. Visualize

Combines coordinates, topic labels, structured fields, summaries, and HF metadata into an interactive HTML map using DataMapPlot, with multiple colormaps, search, hover tooltips, click-to-open functionality, and an injected advanced-filters panel.

Tools & Technologies #

ToolRole
HuggingFace Hub APIDataset enumeration and card download
Cohere embed-v4.0Card text embedding (512 dimensions)
UMAPDimensionality reduction from 512D to 2D
ToponymyHierarchical density-based topic labeling
DataMapPlotInteractive HTML map rendering
Claude Sonnet 4Topic label generation (inside Toponymy)
Claude Haiku 4.5Structured-field extraction and TL;DR summarization

Notable Parameters #

Key parameter values used across the pipeline. These are the authoritative reference; some also appear inline in the step descriptions above.

ParameterValueNotes
Corpus (step 0)
Rank signallikesSingle field passed to HfApi.list_datasets(sort=...)
Target count5,000Final corpus size after filtering
Fetch overshoot6,000Ask for more so short-card filtering doesn't shrink below target
Min card length200 charsAfter YAML stripping; shorter cards excluded
Card truncation4,000 charsFor embedding and storage
Embedding (step 1)
ModelCohere embed-v4.0embed-v4.0, input_type="clustering"
Dimensions512Output vector size
Batch size96Cards per API call
UMAP dimensionality reduction (step 2)
n_neighbors15Local neighborhood size
min_dist0.05Controls tightness of clusters in 2D
MetriccosineDistance metric
random_state42For reproducibility
Topic labeling (step 3)
ModelClaude Sonnet 4claude-sonnet-4-20250514
min_clusters4Toponymy clusterer minimum
Detail levels0.5–1.0lowest_detail_level to highest_detail_level
Object description"HuggingFace dataset cards"Passed to Toponymy LLM wrapper
Corpus description"collection of the top 5,000 HuggingFace datasets ranked by likes"Passed to Toponymy LLM wrapper
Structured-field extraction (step 4)
ModelClaude Haiku 4.5claude-haiku-4-5-20251001
Card truncation6,000 charsPer call, before sending to the model
Concurrency12Async semaphore
Cache controlephemeralSystem prompt is cache-tagged for prompt caching
Card summarization (step 4b)
ModelClaude Haiku 4.5claude-haiku-4-5-20251001
Card truncation4,000 charsShorter than extract since TL;DRs come from the opening
Max words25Hard budget enforced by prompt
Concurrency12Async semaphore

LLM Prompts #

Structured-field extraction (step 4)

The system prompt is assembled at runtime from pipeline/taxonomy.json, which defines eight fields and their allowed slugs. For each field the prompt enumerates the slugs and their descriptions; the user message contains the dataset id and the truncated card body. The model returns a JSON object with one entry per field, where each entry has both a value (a slug, list of slugs, or boolean) and a short verbatim quote from the card that justified the choice. The scaffold below shows the system-prompt structure; field definitions are sourced from pipeline/taxonomy.json in the repo.

You extract constrained structured metadata from HuggingFace dataset cards.

RULES:
- For slug-valued fields, return one of the provided slugs verbatim. No paraphrases, no combined values like 'a / b'.
- For LIST-typed fields, the `value` MUST be a JSON array even if only one item applies: `["item"]`. Never a bare string.
- Each field captures a DIFFERENT axis. Do NOT reuse a slug from one field as the value for another field. Axis definitions: `subject_domain` = what the data is ABOUT (noun); `provenance_method` = HOW the data was created; `training_stage` = what the data is FOR in the training stack; `format_convention` = the SCHEMA SHAPE of each record; `special_characteristics` = orthogonal PROPERTIES (long-context, roleplay, multilingual-parallel, reasoning-traces, etc.). […]
- For each field, include `quote`: a ≤25-word verbatim span from the card that justified your choice. If silent, use the sentinel 'not_stated' for the quote.
- Output strictly valid JSON. No prose, no markdown fences, no commentary outside the JSON object.

FIELD DEFINITIONS:

FIELD: provenance_method (pick EXACTLY ONE slug)
  - human-created: Primary-source content written, recorded, or annotated by humans …
  - web-scraped: Harvested from public web sources …
  - llm-generated: Content produced by an LLM or generative model …
  [… 8 more slugs …]

FIELD: subject_domain (pick EXACTLY ONE slug)
  - general-web-text, instruction-and-chat, code-and-software, math-and-reasoning,
    scientific-research, medical-and-biomedical, natural-images-and-video,
    generated-media, speech-and-audio, legal-and-policy, finance-and-business,
    safety-alignment, robotics-and-embodied, 3d-and-simulation, agent-and-tool-use,
    entertainment-and-gaming, geospatial, multi-domain, rag-evaluation, other, not_stated

FIELD: training_stage (return a JSON LIST of applicable slugs …)
  - pretraining, sft, preference, eval, domain-finetune, raw-corpus,
    not_applicable, other, not_stated

FIELD: format_convention (pick EXACTLY ONE slug)
  - sharegpt, alpaca, preference-pairs, prompt-completion, vqa, image-caption,
    audio-transcript, raw-text, tabular, structured-record, other, not_stated

FIELD: special_characteristics (return a JSON LIST …)
  - long-context, multi-turn, adversarial, reasoning-traces, low-resource-language,
    multilingual-parallel, roleplay, long-form-generation

FIELD: geo_scope (return a JSON LIST)
  RULE: country/region names if explicit; ['global'] only if explicit;
        ['not_applicable'] for content with no meaningful geography;
        ['not_stated'] is the default if silent.

FIELD: upstream_models (return a JSON LIST of raw strings)
  RULE: Models explicitly mentioned as having generated this data
        (e.g. 'GPT-4', 'Claude 3.5 Sonnet'). ['not_applicable'] / ['not_stated'] otherwise.

FIELD: is_benchmark (return true or false)
  RULE: true only if the dataset is explicitly intended as an evaluation benchmark.

OUTPUT SHAPE (fill in values; do not change the structure):
{
  "provenance_method": { "value": <per field rule>, "quote": "<≤25-word span or sentinel>" },
  "subject_domain":    { "value": … },
  …
}

Card summarization (step 4b)

Each card (truncated to 4,000 characters) is sent to Claude Haiku 4.5 with the system prompt below. The model returns a JSON object with a single summary field containing one sentence of at most 25 words.

You write short, specific TL;DR summaries of HuggingFace datasets.

Your summary MUST:
- Be a single sentence of ≤25 words.
- Describe what the dataset IS directly, using a specific noun phrase as the opening.
- Mention what makes it distinctive: origin, scale, methodology, unique property, or source.
- Be self-contained — a reader should understand the dataset without any other context.

Your summary MUST NOT:
- Start with "This dataset…", "A dataset of…", "This is…", or similar filler openings.
- Exceed 25 words.
- Be generic (e.g. "A text classification dataset" is useless — say what it classifies and why it's interesting).
- Include marketing prose or hedging ("powerful", "comprehensive", "may be useful for…").

Good example summaries:
- "12 million YouTube Music track links auto-discovered by recursively walking 'fans might also like' suggestions from a seed of 45,000 artists."
- "Japanese translation of LLaVA-Instruct-150K via DeepL, for Japanese vision-language instruction tuning."
- "One million anonymized real-estate listings from Divar, Iran's largest classifieds platform, with 57 columns of price and location detail."
- "31 English short-answer questions on communication networks, with reference answers and scored student responses for feedback-generation training."

Output strictly valid JSON: {"summary": "<your sentence>"}. No prose, no markdown fences, nothing outside the JSON.

Topic labeling (step 3)

Topic labels are generated by the Toponymy library, which constructs its own LLM prompts internally. The pipeline configures Toponymy with values that shape those prompts: the object_description is set to "HuggingFace dataset cards" and the corpus_description to "collection of the top 5,000 HuggingFace datasets ranked by likes". Exemplar documents are delimited with triple-quoted Python docstring markers (    * """…"""). Toponymy generates labels at detail levels between 0.5 and 1.0, with the number of hierarchical layers determined automatically by the clustering algorithm (min_clusters=4). Claude Sonnet 4 serves as the LLM backend for generating human-readable topic names.

Using the Visualization #

Field Definitions #

ColormapDescription
Task CategoryThe first task_categories tag the dataset card declares (e.g. text-classification, question-answering). The top 9 are shown individually; all others are grouped as "Other".
ModalityThe first modalities tag the dataset card declares (text, image, audio, video, etc.). The top 9 are shown individually; all others are grouped as "Other".
LicenseLicenses grouped into families: MIT, Apache, GPL Family, BSD, MPL, CC BY, CC ShareAlike, CC NonCommercial, CC0 / Public Domain, CC (other), ODC / ODbL, OpenRAIL, Llama License, Gemma License, Other Permissive, Other.
Size CategoryAn ordinal bucket derived from the dataset's size_categories tag: <1K, 1K–10K, 10K–100K, 100K–1M, 1M–10M, 10M–100M, 100M–1B, 1B–10B, 10B–100B, 100B–1T, >1T, Unknown.
LanguageA single label per dataset: Multilingual if multiple languages or the multilinguality tag says so; Translation for parallel/translation corpora; otherwise the human-readable name of the single declared language. The top 9 are shown individually; all others are grouped as "Other".
Likes (log10)Base-10 logarithm of the dataset's HuggingFace like count. Log scale spreads out the long tail of low-likes datasets while preserving differences among the most-liked.
Downloads (log10)Base-10 logarithm of the dataset's all-time download count.
Subject Domain (LLM)What the data is ABOUT, classified by Claude Haiku from a fixed taxonomy. Values include general-web-text, instruction-and-chat, code-and-software, math-and-reasoning, scientific-research, medical-and-biomedical, natural-images-and-video, generated-media, speech-and-audio, legal-and-policy, finance-and-business, safety-alignment, robotics-and-embodied, 3d-and-simulation, agent-and-tool-use, entertainment-and-gaming, geospatial, multi-domain, rag-evaluation, other, not_stated.
Provenance (LLM)HOW the data was created. Values: human-created, web-scraped, sensor-recorded, llm-generated, translated, filtered-subset, remix, algorithmically-derived, mixed, other, not_stated.
Training Stage (LLM)The dataset's role in the training stack. A primary stage is picked per dataset using a priority order (preference > sft > eval > domain-finetune > pretraining > raw-corpus). Underlying multi-select values are preserved in the parquet and shown as a comma-joined list in hover.
Format Convention (LLM)The schema shape of each record. Values: sharegpt, alpaca, preference-pairs, prompt-completion, vqa, image-caption, audio-transcript, raw-text, tabular, structured-record, other, not_stated.
Role: Benchmark vs Training (LLM)Derived from the is_benchmark boolean: Benchmark, Training data, or Unknown.

Prior Work #

A close prior example is the ArXiv Data Map, which uses the same embed–reduce–label–render pipeline on 2.4 million ArXiv papers. The pipeline structure here is the same; the differences are the corpus (HF dataset cards instead of ArXiv abstracts), an additional LLM-extraction stage that produces structured fields against a fixed taxonomy, and per-card LLM-written TL;DR summaries.

More broadly, this project sits in a tradition of spatially mapping knowledge domains. Katy Börner's UCSD Map of Science, described in Atlas of Science, mapped 7.2 million scientific publications across 554 subdisciplines using bibliometric citation networks — an early demonstration that large document corpora can be organized usefully in two dimensions.

HuggingFace's own dataset hub UI is the natural baseline for browsing datasets, but it is a list-and-filter interface, not a spatial one. Two datasets that are conceptually neighbors (say, two Japanese-translation instruction sets) might appear pages apart there while sharing the same patch on this map. The map is meant as a complement, not a replacement.

A sister project applies the same pipeline to the top 10,000 most-starred GitHub repositories, with a different LLM-extraction taxonomy (project type, target audience) suited to source repositories rather than datasets.

Acknowledgements #

Built on open-source tools from the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering. HuggingFace provides the underlying dataset metadata and card content through the Hub API.

Limitations #

This is a map of HuggingFace datasets, not the map. As with any map, the map is not the territory. It's a projection shaped by specific choices about what to measure, how to measure it, and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.

About #