About

How the Semantic Map of the OEIS works

Data as of April 2026

Overview #

This is a 2D map of 25,000 entries from the Online Encyclopedia of Integer Sequences. Sequences with similar descriptive text appear near each other, and named themes (recurrence sequences, prime factorization, triangular arrays, and so on) are surfaced as cluster labels at multiple zoom levels.

The encyclopedia itself contains roughly 394,000 sequences as of April 2026. The map shows a 25,000-entry subset chosen for documentation richness. The rest of this page describes how the subset was selected, how the positions and labels were generated, and what the various colormaps in the visualization mean.

Which sequences are on the map #

The OEIS contains everything from foundational sequences (Fibonacci, primes, Catalan numbers) to one-off submissions someone discovered last week. We wanted a map that surfaces well-documented and well-attended-to sequences without drowning in the long tail. The 25,000 selected sequences come from combining two ideas:

Seed: editor-curated favorites. About 6,900 sequences carry the OEIS keywords core ("foundational") or nice ("exceptionally good"), assigned by editors to flag entries they think people should know about. All of these are included.

Topup: highest-quality scored remainder. The other ~18,100 are picked from the remaining ~387,000 by ranking on a composite quality score:

score = log1p(edit_count)
  + 1.0 · log1p(len(comments))
  + 0.5 · log1p(len(formulas))
  + 0.5 · log1p(len(examples))
  + 0.5 · len(code_languages)
  + 0.3 · log1p(n_references)

What each signal is, and why it's a quality signal:

Signal	What it measures
`edit_count`	Number of revisions the OEIS entry has accumulated. A sequence that gets repeatedly edited is one the community keeps returning to. Sustained edits are the closest thing the OEIS has to "this entry has owners."
`len(comments)`	Total characters in the prose comments section. Long comments mean a contributor took time to explain related results, alternative formulations, or historical context.
`len(formulas)`	Total characters in the formulas section. Multiple formulas typically come from multiple contributors, each having derived or restated the sequence in their preferred form.
`len(examples)`	Total characters in worked-example text. Examples are effort someone put in to make the sequence concrete and understandable.
`len(code_languages)`	Number of distinct programming languages with a contributed implementation (Maple, Mathematica, PARI, Python, Haskell, and so on). Independent reimplementations across languages mean the sequence interested people in several different communities. Genuinely strong signal.
`n_references`	Count of book and paper citations associated with the entry. Academic footprint.

The weights came from an empirical calibration, not just guesswork. We had a labeled negative class of about 2,400 programmatically-generated cellular-automaton bulk submissions: sequences uploaded en masse by an automated tool, which look superficially well-formed (they have comments, references, links) but are exactly the kind of entry the map shouldn't be cluttered with. We computed each signal's AUC at separating bulk from non-bulk and weighted accordingly.

Two signals turned out to be misleading and were excluded:

Hyperlink count (the number of %H entries) actually scored HIGHER for bulk submissions than for genuine sequences. Bulk uploads bundle templated links to the same handful of resources (Wolfram MathWorld, b-files, OEIS index pages); including this signal would have promoted bulk content rather than filtering it.
Has-code (binary). A simple "is there any implementation at all?" flag was nearly random for separating bulk from non-bulk. Replacing it with the COUNT of distinct languages made it useful: any one implementation is weak, but three or four independent implementations in different languages is robust evidence of interest.

How sequences are positioned #

Each sequence is encoded as a 512-dimensional vector by Cohere's embed-v4.0 model. The vectors are then projected to 2D with UMAP, and the 2D position is what determines where each dot lands on the map.

The text fed to the encoder for each sequence concatenates:

The sequence name
The first ~15 leading numerical values
The full comments section
The first formula
The worked-example text
Editorial OEIS keywords like core, nice, easy, hard

Most of the geometric work is done by the prose channels. For sequences with concept-bearing names like "Chebyshev polynomials of the first kind" or "Hexagonal pyramidal numbers", the conceptual tokens in the name and comments dominate placement: Cohere has seen these terms in lots of pretraining data and groups them coherently. For the few hundred famous sequences (Fibonacci, primes, Catalan, factorials), the leading-values prefix also acts as a uniquely-identifying signature, since something like 0, 1, 1, 2, 3, 5, 8, 13, 21 appears verbatim in many Fibonacci-related contexts in Cohere's training corpus. For the long tail of obscure sequences whose specific value prefixes Cohere hasn't memorized, the values channel is mostly opaque integers, and placement comes from whatever conceptual scaffolding the comments provide.

Some content keywords (tabl, mult, sign, and so on, describing what kind of sequence this is) are deliberately withheld from the embedded text, so they can serve as an independent signal for evaluating embedding quality. Editorial keywords that reflect editorial attention rather than mathematical content are kept.

We also experimented with retrofitting the embeddings against the OEIS %Y cross-reference graph (Faruqui et al., NAACL 2015, Laplacian smoothing). Once the embed text included worked examples, the smoothing traded content structure for graph-reconstruction fidelity without producing visibly better cluster names, so the published map is built from raw Cohere embeddings.

What the cluster labels mean #

The cluster labels overlaid on the map come from Toponymy, a hierarchical density-based clustering library. It groups dots by local density and asks Claude Sonnet 4.5 to invent a human-readable name for each cluster from a few representative sequences.

The result is a five-level hierarchy: at the broadest level there are 8 themes (recurrence sequences, prime factorization, triangular arrays, and so on); at the finest level there are 597 specific sub-themes. Different levels surface as you zoom in or out. About a third of sequences land in regions Toponymy can't find a clean theme for; those show up as Unlabelled.

Per-sequence tags #

Beyond the spatial position and the cluster names, each sequence is tagged with four classification axes generated by Claude Sonnet 4.5 reading the OEIS entry text:

Math domain (12 values): number theory, combinatorics, analysis, algebra, graph theory, geometry, discrete dynamics, recreational, computer science, physics & chemistry, probability & stochastic, other.
Sequence type (9 values): what each term a(n) structurally represents (enumeration, arithmetic function, recurrence, closed form, constant digits, table flattened, characteristic, ranked list, other).
Growth class (9 values): how fast the sequence grows (finite, bounded, linear, polynomial, exponential, factorial-or-faster, logarithmic-or-subpoly, oscillating, unknown).
Origin era (5 values): when the underlying mathematical concept was first studied (classical pre-1900, early 20th century, mid 20th century, modern post-2000, unknown). This is the era of the idea, not the date the OEIS entry was created.

These tags appear in the hover card, are searchable, and drive most of the colormaps. They're LLM-generated without human review, so they can be wrong (see Limitations).

Show the system prompt used by stage 03

<task>
You classify Online Encyclopedia of Integer Sequences (OEIS) entries into a fixed
4-field taxonomy. For each sequence in the input, return one classification by
calling the classify_sequences tool. Use the values preview, name, formula,
comments, keywords, author, and last-edited year to inform your decision. The
Author field often carries explicit historical attribution (e.g., a name + year);
the LastEdited field is a weak hint about OEIS-entry recency, NOT original
authorship of the mathematical concept. Always pick a single best-fitting enum
value.
</task>

<math-domain>
Pick the SINGLE most-fitting mathematical area for the sequence:
- "number_theory" — primes, divisors, totient, modular arithmetic, Diophantine equations,
  polygonal/figurate numbers (triangular, square, pentagonal, k-gonal, lattice spirals)
- "combinatorics" — counting, partitions, permutations, lattice paths, set enumeration
- "algebra" — group/ring/field structure, polynomial sequences, algebraic invariants
- "analysis" — Taylor coefficients, integral transforms, special functions, decimal expansions
  of constants, modular forms / theta-eta quotients / q-series / Ramanujan-style power-series
  coefficients (these are coefficients of meromorphic functions, not "combinatorics")
- "geometry" — distances, areas, volumes, lattice points, polytopes, packings, actual
  geometric objects in space (NOT polygonal numbers — those are number_theory)
- "graph_theory" — graphs, trees, networks, colorings, matchings, automorphisms
- "discrete_dynamics" — iterated maps, cellular automata (including row/column/diagonal
  binary or decimal representations of CA growth), Collatz-like, recurrences over finite state
- "recreational" — puzzles, magic squares, palindromes, base-dependent curiosities, word play
- "physics_chemistry" — physical constants, chemical isomer counts, lattice/spin models
- "computer_science" — algorithm complexity, codes, automata theory, programming-language objects
  (NOT cellular automata — those are discrete_dynamics)
- "probability_stochastic" — random walks, branching processes, expected values, occupancy
- "other" — doesn't fit any of the above
</math-domain>

<sequence-type>
What does each term a(n) of the sequence represent?
- "enumeration" — counts of combinatorial structures parameterized by n (e.g., "number of partitions of n")
- "arithmetic_function" — value of a number-theoretic function at n (e.g., d(n), φ(n), σ(n))
- "recurrence" — defined by a recursive formula in earlier terms (e.g., a(n) = a(n-1) + a(n-2))
- "closed_form" — defined by an explicit closed-form expression in n
- "constant_digits" — the n-th digit (or term) of a real-valued constant's expansion
- "table_flattened" — a 2D triangle/table read by antidiagonals or rows (Pascal's triangle, Stirling numbers)
- "characteristic" — 1 if n has property P, else 0 (characteristic function of a set of integers)
- "ranked_list" — the n-th element of an enumeration of integers with some property (e.g., the n-th prime)
- "other" — none of the above

Boundary rules:
- Prefer "enumeration" over "arithmetic_function" if the sequence counts structures, even if it can be written as f(n).
- Prefer "ranked_list" over "characteristic" for sequences like "the prime numbers" (2, 3, 5, 7, …)
  where the indexing is over qualifying elements rather than over all integers.
- Prefer "table_flattened" over "enumeration" for triangular tables like Pascal's,
  even though each row counts something.
- Pure polynomial closed forms with NO combinatorial interpretation (e.g., "a(n) = 12*n^2 + 1",
  "a(n) = n*(6*n+4)", "a(n) = (9*n^2 - 3*n + 2)/2") are "closed_form", NOT "arithmetic_function"
  ("arithmetic_function" is reserved for well-known number-theoretic functions like d(n), φ(n),
  σ(n), ω(n), Ω(n)).
</sequence-type>

<growth-class>
How fast does a(n) grow with n? (Look at the values preview AND the formula.)
- "finite" — sequence has finitely many terms (look for "fini" or "full" in keywords)
- "bounded" — bounded above; doesn't grow (constant sequence, single-digit terms like π's expansion)
- "linear" — grows like cn or cn + d
- "polynomial" — grows like n^k for some k > 1
- "exponential" — grows like c · r^n for r > 1 (Fibonacci, 2^n, Catalan ~ 4^n)
- "factorial_or_faster" — grows like n! or faster (factorials, powers of factorials, Bell numbers)
- "logarithmic_or_subpoly" — grows slower than any polynomial (the n-th prime ~ n log n is here, NOT polynomial)
- "oscillating" — does not have a clear monotonic growth (alternating signs, periodic, ±1 patterns)
- "unknown" — growth rate not determinable from the data given

Boundary rule: for ranked-list sequences (n-th prime, n-th squarefree), classify by how the n-th
ELEMENT grows, not by how the count of qualifying integers grows. The n-th prime grows ~ n log n,
which is "logarithmic_or_subpoly" (not "polynomial").
</growth-class>

<origin-era>
When was the underlying mathematical concept first studied? (NOT when the OEIS entry was created.)
Use the name, comments, formulas, **Author**, and **LastEdited** year to infer.
- "classical_pre1900" — known before 1900 (Fibonacci, primes, Catalan, factorials, π digits, Pascal,
  Euler totient, Bernoulli numbers, polygonal numbers, ancient combinatorial puzzles)
- "early_20c_1900_1950" — first studied 1900–1950 (Ramanujan partitions, Hardy-Littlewood era,
  Polya enumeration, Bell numbers, early algebraic combinatorics)
- "mid_20c_1950_2000" — first studied 1950–2000 (most computer-era sequences, OEIS founding era,
  Conway's recreational sequences, early CA work, modern algorithmic combinatorics)
- "modern_post2000" — first studied after 2000 (21st-century OEIS contributions, modern combinatorics
  papers, recent CA enumeration, contemporary number-theoretic experiments)
- "unknown" — cannot determine

How to use the Author and LastEdited fields:
- If **Author** contains an explicit year (e.g., "_Wolfdieter Lang_, May 04 2018"), that year is
  when the OEIS entry was created — a strong upper bound but NOT necessarily the original era.
  Combine it with the name/comments: if the comments establish pre-1900 pedigree (e.g., "studied
  by Euler", "Leonardo of Pisa, 1202"), use the historical era; otherwise the explicit Author year
  is the best signal we have.
- If Author is "_N. J. A. Sloane_" with NO year, the entry comes from the OEIS founder's seed
  population and could be any era — fall back to name/comments analysis.
- LastEdited is the most-recent-edit year, NOT the original authorship date. Use it ONLY as a
  weak negative signal: a sequence with LastEdited 2020+, no historical pedigree in name/comments,
  and a modern Author name leans toward modern_post2000 or mid_20c_1950_2000.

Default: if the sequence is named after a pre-1900 mathematician or refers to an ancient object,
pick "classical_pre1900" regardless of Author/LastEdited dates. For obscure sequences with comments
referencing modern computer experiments and no clear historical pedigree, prefer "modern_post2000"
when the Author year is post-2000, else "mid_20c_1950_2000". When truly uncertain, use "unknown" —
but try to commit to an era when there is even weak evidence.
</origin-era>

<examples>
A000045 Fibonacci numbers: F(n) = F(n-1) + F(n-2), values 0, 1, 1, 2, 3, 5, 8, 13, 21, …
→ math_domain=combinatorics, sequence_type=recurrence, growth_class=exponential, origin_era=classical_pre1900
Reasoning: counts pairs in many bijections; classic two-term recurrence; golden ratio growth; Leonardo of Pisa, 1202.

A000040 The prime numbers: 2, 3, 5, 7, 11, 13, 17, 19, 23, …
→ math_domain=number_theory, sequence_type=ranked_list,
  growth_class=logarithmic_or_subpoly, origin_era=classical_pre1900
Reasoning: number theory's foundational object; the n-th prime grows ~ n log n (subpolynomial); studied since Euclid.

A000108 Catalan numbers: 1, 1, 2, 5, 14, 42, 132, 429, …
→ math_domain=combinatorics, sequence_type=enumeration, growth_class=exponential, origin_era=classical_pre1900
Reasoning: counts many structures (Dyck paths, binary trees, triangulations); 4^n / n^(3/2) growth; Catalan 1838.

A000796 Decimal expansion of Pi: 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, …
→ math_domain=analysis, sequence_type=constant_digits, growth_class=bounded, origin_era=classical_pre1900
Reasoning: digits of a transcendental constant; bounded between 0 and 9; π is studied since antiquity.

A007318 Pascal's triangle read by rows: 1, 1, 1, 1, 2, 1, 1, 3, 3, 1, 1, 4, 6, 4, 1, …
→ math_domain=combinatorics, sequence_type=table_flattened, growth_class=exponential, origin_era=classical_pre1900
Reasoning: a flattened triangular table of binomial coefficients; central column ~ 2^n / sqrt(n); Pascal 1654.
</examples>

Using the visualization #

Pan and zoom: click and drag to pan, scroll to zoom.
Hover: see a card with id, edit count, reference count, name, leading formula, first values, the LLM-extracted math domain, sequence type, growth class, and origin era, the author, and the OEIS keywords assigned to the sequence (definitions on the OEIS keyword wiki).
Click: opens the sequence on oeis.org in a new tab.
Search: find sequences by id (e.g. A000045), name (e.g. Fibonacci), author, or any keyword that appears in the first 400 characters of comments.
Colormaps: the dropdown at lower-left switches the dot coloring between Toponymy clusters (default), the four LLM tag axes (math domain, sequence type, growth class, origin era), author, edit count, and reference count.
Marker size: bigger dots have higher edit count (a rough proxy for editorial attention). Famous sequences like Fibonacci and the primes are the largest dots on the map.

What each colormap means #

Colormap	Description
Clusters (default)	Five hierarchical layers of Toponymy cluster names (8 / 16 / 49 / 173 / 597 unique labels, coarsest first). Cluster labels appear directly on the map and adapt to the current zoom level.
Math domain (LLM)	The single most-fitting mathematical area. The most populated buckets in the curated set are number theory and combinatorics.
Sequence type (LLM)	What each term structurally represents.
Growth class (LLM)	Asymptotic growth rate, classified from the leading values and the formula. Ranked-list sequences (the n-th prime, the n-th squarefree integer) are classified by how the n-th element grows, not by how its index counts; the n-th prime is logarithmic-or-subpoly (it grows like n log n), not polynomial.
Origin era (LLM)	When the underlying concept was first studied. The model uses the sequence name, comments, and author together; classical-pre1900 wins whenever the sequence is named after a pre-1900 mathematician or refers to an ancient object, regardless of when the OEIS entry was added.
Author	The primary author of the OEIS entry. Bucketed to the top 10 most-prolific contributors plus "Other" for legibility on the colormap; the hover card always shows the author's full name when it could be parsed from the `%A` line.
Edit count (log10)	Base-10 logarithm of the OEIS entry's revision count. Log scale spreads out the long tail.
References (log10 + 1)	Base-10 logarithm of (1 + number of book and paper citations).

Each sequence's full OEIS keyword list (core, nice, easy, tabl, mult, nonn, and so on) appears at the bottom of the hover card. Definitions are on the OEIS keyword wiki.

Notable parameters #

Key configuration values used across the pipeline. Some appear inline in the descriptions above; the table below collects them in one place as the authoritative reference.

Parameter	Value	Notes
Corpus
Source	`oeis/oeisdata`	Sparse checkout via `scripts/sync_seq.sh`
Total sequences upstream	394,561	As of April 2026
Curated target	25,000	The published map
Hard-exclude keywords	`dead`, `dupe`, `uned`, `dumb`	Applied at all scopes
Min visible terms	8	Sequences with fewer terms are excluded
Curated seed	`core` ∪ `nice`	~6,900 sequences
Embedding text budgets (per sequence, stage 04)
Name truncation	300 chars
Comment truncation	1,500 chars
Formula truncation	500 chars
Example truncation	300 chars
Values shown	15	Leading numeric terms kept for embedding hint and tooltip
LLM classification prompt budgets (stage 03)
Comment truncation	800 chars	Tighter than the embed-side budget; prompt economy
Formula truncation	400 chars
Author truncation	100 chars
LLM classification (stage 03)
Model	Claude Sonnet 4.5	`claude-sonnet-4-5`
Batch size	25 sequences	Per tool-use call
Concurrency	30	Async semaphore
Max retries	5	Exponential backoff on rate-limit and API errors
Embedding (stage 04)
Model	Cohere `embed-v4.0`	`input_type="clustering"`
Dimensions	512
Batch size	96	Cohere API call
Content keywords	excluded from embed text	So they remain an orthogonal eval signal
UMAP (stage 05)
`n_neighbors`	15	Local neighborhood size
`min_dist`	0.05	Controls cluster tightness
Metric	`cosine`
`random_state`	42	Reproducibility
Topic labeling (stage 06)
Model	Claude Sonnet 4.5	Toponymy `AsyncAnthropicNamer`
`min_clusters`	4	Toponymy clusterer minimum
Detail levels	0.5–1.0
Object description	"OEIS integer sequences"
Layers (curated)	5	8 / 16 / 49 / 173 / 597 unique labels coarsest→finest
Visualization (stage 07)
Marker size	sqrt(edit_count) → [3, 15] px	Linear stretch; high-edit sequences are the largest dots
Author bucketing	Top 10 + "Other"	Top by per-author sequence count
Click handler	`oeis.org/{id}`	Opens in a new tab

Limitations #

This is one map of the OEIS, not the only possible one. As with any map, the map is not the territory: it's a projection shaped by specific choices about what to measure and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.

The 25k filter excludes most of the OEIS. The score over-represents sequences that have attracted editorial attention. A mathematically interesting but sparsely-documented sequence will be excluded; conversely a heavily-edited but niche entry can be included.
Cohere is not math-tuned. The encoder was trained on general web and document text. It's good at coarse semantic similarity (it can tell that two number-theory entries are about number theory) but it has no special understanding of mathematical structure. A sequence whose prose is generic and whose values aren't in the well-known canon can land somewhere odd. A math-aware encoder would likely sharpen the geometry meaningfully.
The numerical values themselves aren't embedded. Only the textual rendering of the first ~15 leading terms participates in the encoding. The full value lists, b-files, and any numerical structure beyond what the prose describes are invisible to the layout.
UMAP artifacts. Projecting 512 dimensions onto 2 inevitably loses information. Nearby points genuinely have similar embeddings, and the relative positions of clusters are broadly meaningful, but inter-cluster distances should not be read as precise measurements of dissimilarity.
LLM-generated tags can be wrong. The four classification axes and the cluster names are produced by Sonnet without human review. Some real distinctions (specific sequence families, particular author lineages, sub-genres of recreational mathematics) are not captured by any field.
Point in time. The OEIS adds and edits sequences daily; this map is a snapshot. There is no incremental-update mechanism, so refreshing requires a full re-run.
Proprietary tooling. There is some irony in using closed-source models (Cohere's embedder and Anthropic's Claude) to map one of the longest-running open mathematical resources on the internet. These were chosen for quality and convenience but they mean the pipeline can't be fully reproduced without commercial-API access, and the embedding and labeling behavior is opaque.

Credits #

Sequence data is from the OEIS Foundation and the thousands of contributors whose work fills the encyclopedia. Embeddings are from Cohere; LLM tagging and cluster naming use Anthropic Claude Sonnet 4.5. The pipeline depends on open-source tools from the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering.

Author: Steven Fazzio
Source code: stevenfazzio/oeisdata-map
Sister projects: semantic-github-map (top 10k starred GitHub repositories) and huggingface-dataset-map (top 5k liked HuggingFace datasets).
Feedback: bug reports welcome. Please open an issue.
License: pipeline code is MIT-licensed; sequence data is governed by the upstream OEIS license bundled in the source repository's LICENSE file.