Methodology

How the Semantic Map of GitHub visualization is built

Data as of April 2026

Overview #

This visualization maps the top 10,000 most-starred GitHub repositories onto a 2D plane, positioned by the semantic similarity of their README files. Repositories with similar descriptions and purposes appear near each other, revealing natural clusters of related projects across the open-source ecosystem.

The map is generated by a pipeline that fetches GitHub repository data, embeds README content into high-dimensional vectors, reduces those vectors to two dimensions, applies hierarchical topic labeling, and renders an interactive HTML visualization.

Corpus Collection #

The pipeline uses a two-phase approach to identify the top 10,000 most-starred repositories. First, it queries GH Archive on BigQuery for repositories with significant recent star activity, producing a generous candidate list (~25K repos). Then, each candidate is looked up via the GitHub GraphQL API (batched 25–50 per request), collecting metadata (name, stars, language, license, creation date) and the full README content. Results are sorted by star count and the top 10,000 are kept. This avoids the Search API's 1,000-result cap and non-deterministic ordering.

Repositories whose READMEs are shorter than 200 characters after stripping whitespace are excluded before embedding. These include empty READMEs, placeholder files, and monorepo pointers that lack sufficient content for meaningful embedding and map placement.

Each README is processed by Claude Haiku to extract five fields: a project title, a 2–3 sentence summary, a short tagline, a project type classification, and a target audience label. These feed into hover tooltips, search, colormap categories, and topic labeling. The raw README text (not the summaries) is what gets embedded in step 4 to drive map placement.

Processing Pipeline #

0. Enumerate candidates

Queries GH Archive on BigQuery for repositories with significant recent star activity, producing a generous candidate list (~25K repos). This step is optional; a pre-built candidate list is used automatically if this step is skipped.

1. Fetch repositories

Looks up each candidate via the GitHub GraphQL API (batched 25–50 per request), fetching metadata and README content, then sorts by stars and keeps the top 10,000.

2. Select top repos

Trims the dataset to the top 10,000 repositories by star count, backing up the original before trimming.

3. Summarize READMEs

Sends each README to Claude Haiku to extract a project title, short summary, tagline, project type, and target audience, used for hover tooltips, search, and colormap categories.

4. Embed READMEs

Encodes README text into 512-dimensional vectors using Cohere's embed-v4.0 model.

5. Reduce to 2D

Applies UMAP (n_neighbors=15, min_dist=0.05, metric="cosine") to project the 512-dimensional embeddings down to 2D coordinates for map placement.

6. Label topics

Uses the Toponymy library for hierarchical density-based clustering (min_clusters=4, lowest_detail_level=0.5, highest_detail_level=1.0), then sends representative documents from each cluster to Claude Sonnet to generate human-readable topic labels at multiple levels of detail.

7. Visualize

Combines coordinates, labels, and metadata into an interactive HTML map using DataMapPlot, with multiple colormaps, search, hover tooltips, and click-to-open functionality.

Tools & Technologies #

Tool	Role
Cohere `embed-v4.0`	README text embedding (512 dimensions)
UMAP	Dimensionality reduction from 512D to 2D
Toponymy	Hierarchical density-based topic labeling
DataMapPlot	Interactive HTML map rendering
Claude Haiku 4.5	README summarization
Claude Sonnet 4	Topic label generation
GitHub GraphQL API	Repository metadata and README fetching
BigQuery (GH Archive)	Candidate repository enumeration

Notable Parameters #

Key parameter values used across the pipeline. These are the authoritative reference; some also appear inline in the step descriptions above.

Parameter	Value	Notes
Corpus (steps 0–1)
BigQuery date range	2022-01 to 2026-03	GH Archive WatchEvent tables scanned
Min star events	200	Filters out repos with low recent activity
Candidate limit	25,000	BigQuery query LIMIT
Min README length	200 chars	After stripping whitespace; shorter READMEs excluded
Summarization (step 3)
Model	Claude Haiku 4.5	`claude-haiku-4-5`
README truncation	4,000 chars	Longer READMEs are truncated before sending to the model
Embedding (step 4)
Model	Cohere embed-v4.0	`embed-v4.0`
Dimensions	512	Output vector size
UMAP dimensionality reduction (step 5)
`n_neighbors`	15	UMAP local neighborhood size
`min_dist`	0.05	Controls tightness of clusters in 2D
Metric	`cosine`	Distance metric
Topic labeling (step 6)
Model	Claude Sonnet 4	`claude-sonnet-4-20250514`
`min_clusters`	4	Toponymy clusterer minimum
Detail levels	0.5–1.0	`lowest_detail_level` to `highest_detail_level`
Object description	"GitHub repository descriptions"	Passed to Toponymy LLM wrapper
Corpus description	"collection of the top 10,000 most-starred GitHub repositories"	Passed to Toponymy LLM wrapper

LLM Prompts #

Summarization prompt (step 3)

Each README (truncated to 4,000 characters) is sent to Claude Haiku 4.5 with the following system prompt. The model returns a JSON object with five fields used for tooltips, search, and colormaps.

You are given the README of a GitHub repository. Return a JSON object with five fields:
- "title": The project's display name as presented in the README. If the README does not mention a project name, return null.
- "summary": A 2-3 sentence summary explaining what the project does, its key features, and what makes it notable. Focus on specifics that differentiate it from similar projects.
- "project_type": One of: "Library", "Framework", "CLI Tool", "Application", "Dataset", "Tutorial/Educational", "Collection/Awesome List", "Plugin/Extension", "API/Service", "Research", "Other".
- "tagline": A short noun-phrase label (3-7 words) identifying what the project *is* — a category/identity, not a feature list. Write for a technical audience. Bad: 'Modernized git CLI with suggestions and simplified workflows' Good: 'Modern git CLI wrapper'
- "target_audience": The primary audience for this project. One of: "Developers", "Data & ML Engineers", "DevOps & Infrastructure", "System Programmers", "Security Professionals", "End Users", "Learners & Educators", "Researchers". Choose the single best fit:
  - "Developers": General software developers (web, mobile, desktop)
  - "Data & ML Engineers": Data scientists, ML/AI practitioners
  - "DevOps & Infrastructure": Cloud, containers, CI/CD, monitoring
  - "System Programmers": OS, embedded, compilers, low-level systems
  - "Security Professionals": Pentesting, crypto, vulnerability research
  - "End Users": Non-developers who use the software directly
  - "Learners & Educators": Students, tutorial followers, course creators
  - "Researchers": Academic or scientific researchers

Respond with only the JSON object, no markdown fencing.

Topic labeling (step 6)

Topic labels are generated by the Toponymy library, which constructs its own LLM prompts internally. The pipeline configures Toponymy with values that shape those prompts: the object_description is set to "GitHub repository descriptions" and the corpus_description to "collection of the top 10,000 most-starred GitHub repositories". Exemplar documents are delimited with triple-quoted Python docstring markers ( * """…"""). Toponymy generates labels at detail levels between 0.5 and 1.0, with the number of hierarchical layers determined automatically by the clustering algorithm (min_clusters=4). Claude Sonnet 4 serves as the LLM backend for generating human-readable topic names.

Using the Visualization #

Pan and zoom: Click and drag to pan; scroll to zoom in and out.
Hover: Hover over any point to see the project title, owner, tagline, project type, primary language, star and fork counts, and a short summary.
Click: Click any point to open the repository on GitHub in a new tab.
Search: Use the search box to find specific repositories by name.
Filter panel: Open the sidebar to narrow the visible points with range sliders (stars, forks, open issues, created year, days since push) and checkbox filters (language, activity status, license family, project type, target audience, owner type).
Colormaps: Switch between colormaps using the dropdown to color points by language, license family, project type, target audience, activity status, owner type, creation date, last push recency, star count, fork count, or open issues.
Topic labels: Cluster labels appear on the map at multiple levels of detail, from broad categories down to specific sub-topics. The number of levels is chosen automatically by the labeling algorithm.

Field Definitions #

Colormap	Description
Primary Language	The repository's primary programming language as reported by GitHub. The top 9 languages are shown individually; all others are grouped as "Other".
License Family	Licenses grouped into families: MIT, Apache, GPL, BSD, Creative Commons, MPL, Other Permissive, and Unknown/None.
Project Type	The type of project as classified by Claude Haiku: Library, Framework, CLI Tool, Application, Dataset, Tutorial/Educational, Collection/Awesome List, Plugin/Extension, API/Service, Research, or Other.
Target Audience	The intended audience as classified by Claude Haiku: Developers, Data & ML Engineers, DevOps & Infrastructure, System Programmers, Security Professionals, End Users, Learners & Educators, or Researchers.
Activity Status	Three categories based on archive status and recent activity: Active (not archived, pushed within the last 2 years), Inactive (not archived, no push in 2+ years), Archived (explicitly archived by the owner).
Owner Type	Whether the repository is owned by an individual User or an Organization account.
Created Date	The date the repository was created on GitHub. Older repos appear in darker tones; newer repos appear brighter.
Last Push (days, log10)	Base-10 logarithm of days since the most recent push. Green indicates recent activity; red indicates staleness.
Star Count (log10)	Base-10 logarithm of the repository's star count. Log scale helps distinguish differences among highly-starred repos.
Fork Count (log10)	Base-10 logarithm of the repository's fork count. Log scale helps distinguish differences among heavily-forked repos.
Open Issues (log10)	Base-10 logarithm of the repository's open issue count. Log scale normalizes the wide range of issue counts.

Prior Work #

The most notable prior work in this space is Andrei Kashcha's Map of GitHub (source), an impressive visualization that maps roughly 690,000 GitHub repositories onto a zoomable, globe-style interface. It remains one of the most ambitious public efforts to spatially organize the open-source ecosystem.

The core methodological difference is what signal drives map placement. Anvaka's map uses a collaborative-filtering approach: it computes Jaccard similarity over shared stargazers, so repositories that attract the same people end up near each other. This project instead uses a content-based signal. README text is embedded into high-dimensional vectors via Cohere, and repositories with semantically similar descriptions are placed nearby after UMAP reduction. The two approaches produce meaningfully different groupings: anvaka's map clusters repos that appeal to the same audience, while this map clusters repos that describe similar functionality. They also differ in scale, with anvaka's covering ~690K repos compared to 10K here.

There are several other notable differences. For layout, anvaka uses Leiden community detection followed by force-directed graph layout, whereas this project applies UMAP dimensionality reduction directly on the embedding space. For rendering, anvaka's map uses MapLibre GL with vector tiles for smooth globe-style navigation, while this project uses DataMapPlot to produce a self-contained interactive HTML file. For topic labeling, anvaka assigns names to roughly 1,500 cluster "countries" using ChatGPT, while this project uses the Toponymy library to generate hierarchical multi-level labels via Claude Sonnet, producing labels at varying levels of detail rather than a single flat layer. Both are useful views into the same ecosystem.

At a broader level, this project belongs to a tradition of spatially mapping knowledge domains. Katy Börner's UCSD Map of Science, developed with her collaborators and described in her book Atlas of Science, mapped 7.2 million scientific publications across 554 subdisciplines using bibliometric citation networks. That work established many of the ideas this project draws on: that large corpora of documents can be meaningfully organized in two-dimensional space, and that the resulting maps can reveal structure that is difficult to see any other way.

More recently, Leland McInnes' ArXiv Data Map visualized 2.4 million ArXiv papers using an embed–reduce–label–render pipeline that is nearly identical to the one used here. That project served as a direct demonstration that the methodology (and the specific tools) could produce rich, navigable maps of large document collections. This project is essentially an application of the same approach to a different corpus.

Acknowledgements #

This project's pipeline is built on open-source tools created by Leland McInnes and collaborators at the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering. McInnes' ArXiv Data Map demonstrated that these tools, combined with modern text embeddings, could produce navigable, labeled maps of millions of documents. This project applies that same approach to the GitHub ecosystem.

Limitations #

This is a map of GitHub, not the map of GitHub. As with any map, the map is not the territory. It's a projection shaped by specific choices about what to measure, how to measure it, and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.

Embedding model dependence. The embedding model is the single biggest determinant of what "similarity" means on this map. Every downstream step (UMAP layout, clustering, topic labels) inherits the geometry it defines. Cohere's embed-v4.0 was trained on general web and document data, so it captures natural-language similarity well but may underweight signals a developer would consider important: code snippets in READMEs, architectural patterns, or dependency relationships. A different embedding model, especially a code-aware one, could produce a meaningfully different map from the same inputs.
Selection bias. The map includes only the top 10,000 repositories by star count. Stars correlate with visibility, marketing, and English-language documentation, so the map systematically underrepresents non-English projects, niche domain-specific tools, and newer repositories that haven't yet accumulated stars.
README-driven placement. Repositories are positioned by what their README says, not by what their code does. A repository with a sparse or misleading README will be misplaced, and READMEs shorter than 200 characters are excluded entirely. The embedding model's attention also isn't uniform across a long document; content near the beginning of a README likely has more influence on placement than content deeper in the file.
Dimensionality reduction artifacts. UMAP does a good job preserving both local and global structure, but projecting 512 dimensions onto 2 inevitably loses information. Nearby points on the map genuinely have similar READMEs, and the relative positions of clusters are broadly meaningful, but inter-cluster distances should not be read as precise measurements of dissimilarity.
LLM-generated content. Both the per-repo summaries (Claude Haiku) and the topic labels (Claude Sonnet via Toponymy) are generated by LLMs without human review. They can oversimplify, miscategorize, or occasionally hallucinate details that aren't in the source README.
Point-in-time snapshot. The map reflects a single moment. Star counts, README content, activity status, and even repository existence all change over time. There's no mechanism for incremental updates; regenerating the map requires rerunning the full pipeline.
Proprietary tooling. There is some irony in using proprietary, closed-source tools (Cohere's embedding model and Anthropic's LLMs) to build a map of notable open-source projects. These choices were made for quality and convenience, but they mean the pipeline can't be fully reproduced without access to commercial APIs, and the embedding and labeling behavior is opaque.

About #

Author: Steven Fazzio
Source code: stevenfazzio/semantic-github-map
Feedback: Bug reports and feature requests are welcome. Please open an issue
License: MIT License