Methodology

How the Semantic Map of GitHub visualization is built

Data as of March 2026

Overview #

This visualization maps the top 10,000 most-starred GitHub repositories onto a 2D plane, positioned by the semantic similarity of their README files. Repositories with similar descriptions and purposes appear near each other, revealing natural clusters of related projects across the open-source ecosystem.

The map is generated by a pipeline that fetches GitHub repository data, embeds README content into high-dimensional vectors, reduces those vectors to two dimensions, applies hierarchical topic labeling, and renders an interactive HTML visualization.

10K GitHub Repos README Text LLM Summarize Embed (512D) UMAP → 2D Cluster & Label Interactive Map

Corpus Collection #

The pipeline uses a two-phase approach to identify the top 10,000 most-starred repositories. First, it queries GH Archive on BigQuery for repositories with significant recent star activity, producing a generous candidate list (~25K repos). Then, each candidate is looked up via the GitHub GraphQL API (batched 25–50 per request), collecting metadata (name, stars, language, license, creation date) and the full README content. Results are sorted by star count and the top 10,000 are kept. This avoids the Search API's 1,000-result cap and non-deterministic ordering.

Repositories whose READMEs are shorter than 200 characters after stripping whitespace are excluded before embedding. These include empty READMEs, placeholder files, and monorepo pointers that lack sufficient content for meaningful embedding and map placement.

Each README is processed by Claude Haiku to extract five fields: a project title, a 2–3 sentence summary, a short tagline, a project type classification, and a target audience label. These feed into hover tooltips, search, colormap categories, and topic labeling. The raw README text (not the summaries) is what gets embedded in step 4 to drive map placement.

Processing Pipeline #

0. Enumerate candidates

Queries GH Archive on BigQuery for repositories with significant recent star activity, producing a generous candidate list (~25K repos). This step is optional; a pre-built candidate list is used automatically if this step is skipped.

1. Fetch repositories

Looks up each candidate via the GitHub GraphQL API (batched 25–50 per request), fetching metadata and README content, then sorts by stars and keeps the top 10,000.

2. Select top repos

Trims the dataset to the top 10,000 repositories by star count, backing up the original before trimming.

3. Summarize READMEs

Sends each README to Claude Haiku to extract a project title, short summary, tagline, project type, and target audience, used for hover tooltips, search, and colormap categories.

4. Embed READMEs

Encodes README text into 512-dimensional vectors using Cohere's embed-v4.0 model.

5. Reduce to 2D

Applies UMAP (n_neighbors=15, min_dist=0.05, metric="cosine") to project the 512-dimensional embeddings down to 2D coordinates for map placement.

6. Label topics

Uses the Toponymy library for hierarchical density-based clustering (min_clusters=4, lowest_detail_level=0.5, highest_detail_level=1.0), then sends representative documents from each cluster to Claude Sonnet to generate human-readable topic labels at multiple levels of detail.

7. Visualize

Combines coordinates, labels, and metadata into an interactive HTML map using DataMapPlot, with multiple colormaps, search, hover tooltips, and click-to-open functionality.

Tools & Technologies #

ToolRole
Cohere embed-v4.0README text embedding (512 dimensions)
UMAPDimensionality reduction from 512D to 2D
ToponymyHierarchical density-based topic labeling
DataMapPlotInteractive HTML map rendering
Claude Haiku 4.5README summarization
Claude Sonnet 4Topic label generation
GitHub GraphQL APIRepository metadata and README fetching
BigQuery (GH Archive)Candidate repository enumeration

Notable Parameters #

Key parameter values used across the pipeline. These are the authoritative reference; some also appear inline in the step descriptions above.

ParameterValueNotes
Corpus (steps 0–1)
BigQuery date range2022-01 to 2026-03GH Archive WatchEvent tables scanned
Min star events200Filters out repos with low recent activity
Candidate limit25,000BigQuery query LIMIT
Min README length200 charsAfter stripping whitespace; shorter READMEs excluded
Summarization (step 3)
ModelClaude Haiku 4.5claude-haiku-4-5
README truncation4,000 charsLonger READMEs are truncated before sending to the model
Embedding (step 4)
ModelCohere embed-v4.0embed-v4.0
Dimensions512Output vector size
UMAP dimensionality reduction (step 5)
n_neighbors15UMAP local neighborhood size
min_dist0.05Controls tightness of clusters in 2D
MetriccosineDistance metric
Topic labeling (step 6)
ModelClaude Sonnet 4claude-sonnet-4-20250514
min_clusters4Toponymy clusterer minimum
Detail levels0.5–1.0lowest_detail_level to highest_detail_level
Object description"GitHub repository descriptions"Passed to Toponymy LLM wrapper
Corpus description"collection of the top 10,000 most-starred GitHub repositories"Passed to Toponymy LLM wrapper

LLM Prompts #

Summarization prompt (step 3)

Each README (truncated to 4,000 characters) is sent to Claude Haiku 4.5 with the following system prompt. The model returns a JSON object with five fields used for tooltips, search, and colormaps.

You are given the README of a GitHub repository. Return a JSON object with five fields:
- "title": The project's display name as presented in the README. If the README does not mention a project name, return null.
- "summary": A 2-3 sentence summary explaining what the project does, its key features, and what makes it notable. Focus on specifics that differentiate it from similar projects.
- "project_type": One of: "Library", "Framework", "CLI Tool", "Application", "Dataset", "Tutorial/Educational", "Collection/Awesome List", "Plugin/Extension", "API/Service", "Research", "Other".
- "tagline": A short noun-phrase label (3-7 words) identifying what the project *is* — a category/identity, not a feature list. Write for a technical audience. Bad: 'Modernized git CLI with suggestions and simplified workflows' Good: 'Modern git CLI wrapper'
- "target_audience": The primary audience for this project. One of: "Developers", "Data & ML Engineers", "DevOps & Infrastructure", "System Programmers", "Security Professionals", "End Users", "Learners & Educators", "Researchers". Choose the single best fit:
  - "Developers": General software developers (web, mobile, desktop)
  - "Data & ML Engineers": Data scientists, ML/AI practitioners
  - "DevOps & Infrastructure": Cloud, containers, CI/CD, monitoring
  - "System Programmers": OS, embedded, compilers, low-level systems
  - "Security Professionals": Pentesting, crypto, vulnerability research
  - "End Users": Non-developers who use the software directly
  - "Learners & Educators": Students, tutorial followers, course creators
  - "Researchers": Academic or scientific researchers

Respond with only the JSON object, no markdown fencing.

Topic labeling (step 6)

Topic labels are generated by the Toponymy library, which constructs its own LLM prompts internally. The pipeline configures Toponymy with values that shape those prompts: the object_description is set to "GitHub repository descriptions" and the corpus_description to "collection of the top 10,000 most-starred GitHub repositories". Exemplar documents are delimited with triple-quoted Python docstring markers (    * """…"""). Toponymy generates labels at detail levels between 0.5 and 1.0, with the number of hierarchical layers determined automatically by the clustering algorithm (min_clusters=4). Claude Sonnet 4 serves as the LLM backend for generating human-readable topic names.

Using the Visualization #

Field Definitions #

ColormapDescription
Primary LanguageThe repository's primary programming language as reported by GitHub. The top 9 languages are shown individually; all others are grouped as "Other".
License FamilyLicenses grouped into families: MIT, Apache, GPL, BSD, Creative Commons, MPL, Other Permissive, and Unknown/None.
Project TypeThe type of project as classified by Claude Haiku: Library, Framework, CLI Tool, Application, Dataset, Tutorial/Educational, Collection/Awesome List, Plugin/Extension, API/Service, Research, or Other.
Target AudienceThe intended audience as classified by Claude Haiku: Developers, Data & ML Engineers, DevOps & Infrastructure, System Programmers, Security Professionals, End Users, Learners & Educators, or Researchers.
Activity StatusThree categories based on archive status and recent activity: Active (not archived, pushed within the last 2 years), Inactive (not archived, no push in 2+ years), Archived (explicitly archived by the owner).
Owner TypeWhether the repository is owned by an individual User or an Organization account.
Created DateThe date the repository was created on GitHub. Older repos appear in darker tones; newer repos appear brighter.
Last Push (days, log10)Base-10 logarithm of days since the most recent push. Green indicates recent activity; red indicates staleness.
Star Count (log10)Base-10 logarithm of the repository's star count. Log scale helps distinguish differences among highly-starred repos.
Fork Count (log10)Base-10 logarithm of the repository's fork count. Log scale helps distinguish differences among heavily-forked repos.
Open Issues (log10)Base-10 logarithm of the repository's open issue count. Log scale normalizes the wide range of issue counts.

Prior Work #

The most notable prior work in this space is Andrei Kashcha's Map of GitHub (source), an impressive visualization that maps roughly 690,000 GitHub repositories onto a zoomable, globe-style interface. It remains one of the most ambitious public efforts to spatially organize the open-source ecosystem.

The core methodological difference is what signal drives map placement. Anvaka's map uses a collaborative-filtering approach: it computes Jaccard similarity over shared stargazers, so repositories that attract the same people end up near each other. This project instead uses a content-based signal. README text is embedded into high-dimensional vectors via Cohere, and repositories with semantically similar descriptions are placed nearby after UMAP reduction. The two approaches produce meaningfully different groupings: anvaka's map clusters repos that appeal to the same audience, while this map clusters repos that describe similar functionality. They also differ in scale, with anvaka's covering ~690K repos compared to 10K here.

There are several other notable differences. For layout, anvaka uses Leiden community detection followed by force-directed graph layout, whereas this project applies UMAP dimensionality reduction directly on the embedding space. For rendering, anvaka's map uses MapLibre GL with vector tiles for smooth globe-style navigation, while this project uses DataMapPlot to produce a self-contained interactive HTML file. For topic labeling, anvaka assigns names to roughly 1,500 cluster "countries" using ChatGPT, while this project uses the Toponymy library to generate hierarchical multi-level labels via Claude Sonnet, producing labels at varying levels of detail rather than a single flat layer. Both are useful views into the same ecosystem.

At a broader level, this project belongs to a tradition of spatially mapping knowledge domains. Katy Börner's UCSD Map of Science, developed with her collaborators and described in her book Atlas of Science, mapped 7.2 million scientific publications across 554 subdisciplines using bibliometric citation networks. That work established many of the ideas this project draws on: that large corpora of documents can be meaningfully organized in two-dimensional space, and that the resulting maps can reveal structure that is difficult to see any other way.

More recently, Leland McInnes' ArXiv Data Map visualized 2.4 million ArXiv papers using an embed–reduce–label–render pipeline that is nearly identical to the one used here. That project served as a direct demonstration that the methodology (and the specific tools) could produce rich, navigable maps of large document collections. This project is essentially an application of the same approach to a different corpus.

Acknowledgements #

This project's pipeline is built on open-source tools created by Leland McInnes and collaborators at the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering. McInnes' ArXiv Data Map demonstrated that these tools, combined with modern text embeddings, could produce navigable, labeled maps of millions of documents. This project applies that same approach to the GitHub ecosystem.

Limitations #

This is a map of GitHub, not the map of GitHub. As with any map, the map is not the territory. It's a projection shaped by specific choices about what to measure, how to measure it, and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.

About #