About

How the Semantic Map of the OEIS works

Data as of April 2026

Overview #

This is a 2D map of 25,000 entries from the Online Encyclopedia of Integer Sequences. Sequences with similar descriptive text appear near each other, and named themes (recurrence sequences, prime factorization, triangular arrays, and so on) are surfaced as cluster labels at multiple zoom levels.

The encyclopedia itself contains roughly 394,000 sequences as of April 2026. The map shows a 25,000-entry subset chosen for documentation richness. The rest of this page describes how the subset was selected, how the positions and labels were generated, and what the various colormaps in the visualization mean.

394k OEIS sequences Curated 25k LLM Tag (4 axes) Embed (512D) UMAP → 2D Cluster & Label Interactive Map

Which sequences are on the map #

The OEIS contains everything from foundational sequences (Fibonacci, primes, Catalan numbers) to one-off submissions someone discovered last week. We wanted a map that surfaces well-documented and well-attended-to sequences without drowning in the long tail. The 25,000 selected sequences come from combining two ideas:

Seed: editor-curated favorites. About 6,900 sequences carry the OEIS keywords core ("foundational") or nice ("exceptionally good"), assigned by editors to flag entries they think people should know about. All of these are included.

Topup: highest-quality scored remainder. The other ~18,100 are picked from the remaining ~387,000 by ranking on a composite quality score:

score = log1p(edit_count)
  + 1.0 · log1p(len(comments))
  + 0.5 · log1p(len(formulas))
  + 0.5 · log1p(len(examples))
  + 0.5 · len(code_languages)
  + 0.3 · log1p(n_references)

What each signal is, and why it's a quality signal:

SignalWhat it measures
edit_count Number of revisions the OEIS entry has accumulated. A sequence that gets repeatedly edited is one the community keeps returning to. Sustained edits are the closest thing the OEIS has to "this entry has owners."
len(comments) Total characters in the prose comments section. Long comments mean a contributor took time to explain related results, alternative formulations, or historical context.
len(formulas) Total characters in the formulas section. Multiple formulas typically come from multiple contributors, each having derived or restated the sequence in their preferred form.
len(examples) Total characters in worked-example text. Examples are effort someone put in to make the sequence concrete and understandable.
len(code_languages) Number of distinct programming languages with a contributed implementation (Maple, Mathematica, PARI, Python, Haskell, and so on). Independent reimplementations across languages mean the sequence interested people in several different communities. Genuinely strong signal.
n_references Count of book and paper citations associated with the entry. Academic footprint.

The weights came from an empirical calibration, not just guesswork. We had a labeled negative class of about 2,400 programmatically-generated cellular-automaton bulk submissions: sequences uploaded en masse by an automated tool, which look superficially well-formed (they have comments, references, links) but are exactly the kind of entry the map shouldn't be cluttered with. We computed each signal's AUC at separating bulk from non-bulk and weighted accordingly.

Two signals turned out to be misleading and were excluded:

How sequences are positioned #

Each sequence is encoded as a 512-dimensional vector by Cohere's embed-v4.0 model. The vectors are then projected to 2D with UMAP, and the 2D position is what determines where each dot lands on the map.

The text fed to the encoder for each sequence concatenates:

Most of the geometric work is done by the prose channels. For sequences with concept-bearing names like "Chebyshev polynomials of the first kind" or "Hexagonal pyramidal numbers", the conceptual tokens in the name and comments dominate placement: Cohere has seen these terms in lots of pretraining data and groups them coherently. For the few hundred famous sequences (Fibonacci, primes, Catalan, factorials), the leading-values prefix also acts as a uniquely-identifying signature, since something like 0, 1, 1, 2, 3, 5, 8, 13, 21 appears verbatim in many Fibonacci-related contexts in Cohere's training corpus. For the long tail of obscure sequences whose specific value prefixes Cohere hasn't memorized, the values channel is mostly opaque integers, and placement comes from whatever conceptual scaffolding the comments provide.

Some content keywords (tabl, mult, sign, and so on, describing what kind of sequence this is) are deliberately withheld from the embedded text, so they can serve as an independent signal for evaluating embedding quality. Editorial keywords that reflect editorial attention rather than mathematical content are kept.

We also experimented with retrofitting the embeddings against the OEIS %Y cross-reference graph (Faruqui et al., NAACL 2015, Laplacian smoothing). Once the embed text included worked examples, the smoothing traded content structure for graph-reconstruction fidelity without producing visibly better cluster names, so the published map is built from raw Cohere embeddings.

What the cluster labels mean #

The cluster labels overlaid on the map come from Toponymy, a hierarchical density-based clustering library. It groups dots by local density and asks Claude Sonnet 4.5 to invent a human-readable name for each cluster from a few representative sequences.

The result is a five-level hierarchy: at the broadest level there are 8 themes (recurrence sequences, prime factorization, triangular arrays, and so on); at the finest level there are 597 specific sub-themes. Different levels surface as you zoom in or out. About a third of sequences land in regions Toponymy can't find a clean theme for; those show up as Unlabelled.

Per-sequence tags #

Beyond the spatial position and the cluster names, each sequence is tagged with four classification axes generated by Claude Sonnet 4.5 reading the OEIS entry text:

These tags appear in the hover card, are searchable, and drive most of the colormaps. They're LLM-generated without human review, so they can be wrong (see Limitations).

Show the system prompt used by stage 03
<task>
You classify Online Encyclopedia of Integer Sequences (OEIS) entries into a fixed
4-field taxonomy. For each sequence in the input, return one classification by
calling the classify_sequences tool. Use the values preview, name, formula,
comments, keywords, author, and last-edited year to inform your decision. The
Author field often carries explicit historical attribution (e.g., a name + year);
the LastEdited field is a weak hint about OEIS-entry recency, NOT original
authorship of the mathematical concept. Always pick a single best-fitting enum
value.
</task>

<math-domain>
Pick the SINGLE most-fitting mathematical area for the sequence:
- "number_theory" — primes, divisors, totient, modular arithmetic, Diophantine equations,
  polygonal/figurate numbers (triangular, square, pentagonal, k-gonal, lattice spirals)
- "combinatorics" — counting, partitions, permutations, lattice paths, set enumeration
- "algebra" — group/ring/field structure, polynomial sequences, algebraic invariants
- "analysis" — Taylor coefficients, integral transforms, special functions, decimal expansions
  of constants, modular forms / theta-eta quotients / q-series / Ramanujan-style power-series
  coefficients (these are coefficients of meromorphic functions, not "combinatorics")
- "geometry" — distances, areas, volumes, lattice points, polytopes, packings, actual
  geometric objects in space (NOT polygonal numbers — those are number_theory)
- "graph_theory" — graphs, trees, networks, colorings, matchings, automorphisms
- "discrete_dynamics" — iterated maps, cellular automata (including row/column/diagonal
  binary or decimal representations of CA growth), Collatz-like, recurrences over finite state
- "recreational" — puzzles, magic squares, palindromes, base-dependent curiosities, word play
- "physics_chemistry" — physical constants, chemical isomer counts, lattice/spin models
- "computer_science" — algorithm complexity, codes, automata theory, programming-language objects
  (NOT cellular automata — those are discrete_dynamics)
- "probability_stochastic" — random walks, branching processes, expected values, occupancy
- "other" — doesn't fit any of the above
</math-domain>

<sequence-type>
What does each term a(n) of the sequence represent?
- "enumeration" — counts of combinatorial structures parameterized by n (e.g., "number of partitions of n")
- "arithmetic_function" — value of a number-theoretic function at n (e.g., d(n), φ(n), σ(n))
- "recurrence" — defined by a recursive formula in earlier terms (e.g., a(n) = a(n-1) + a(n-2))
- "closed_form" — defined by an explicit closed-form expression in n
- "constant_digits" — the n-th digit (or term) of a real-valued constant's expansion
- "table_flattened" — a 2D triangle/table read by antidiagonals or rows (Pascal's triangle, Stirling numbers)
- "characteristic" — 1 if n has property P, else 0 (characteristic function of a set of integers)
- "ranked_list" — the n-th element of an enumeration of integers with some property (e.g., the n-th prime)
- "other" — none of the above

Boundary rules:
- Prefer "enumeration" over "arithmetic_function" if the sequence counts structures, even if it can be written as f(n).
- Prefer "ranked_list" over "characteristic" for sequences like "the prime numbers" (2, 3, 5, 7, …)
  where the indexing is over qualifying elements rather than over all integers.
- Prefer "table_flattened" over "enumeration" for triangular tables like Pascal's,
  even though each row counts something.
- Pure polynomial closed forms with NO combinatorial interpretation (e.g., "a(n) = 12*n^2 + 1",
  "a(n) = n*(6*n+4)", "a(n) = (9*n^2 - 3*n + 2)/2") are "closed_form", NOT "arithmetic_function"
  ("arithmetic_function" is reserved for well-known number-theoretic functions like d(n), φ(n),
  σ(n), ω(n), Ω(n)).
</sequence-type>

<growth-class>
How fast does a(n) grow with n? (Look at the values preview AND the formula.)
- "finite" — sequence has finitely many terms (look for "fini" or "full" in keywords)
- "bounded" — bounded above; doesn't grow (constant sequence, single-digit terms like π's expansion)
- "linear" — grows like cn or cn + d
- "polynomial" — grows like n^k for some k > 1
- "exponential" — grows like c · r^n for r > 1 (Fibonacci, 2^n, Catalan ~ 4^n)
- "factorial_or_faster" — grows like n! or faster (factorials, powers of factorials, Bell numbers)
- "logarithmic_or_subpoly" — grows slower than any polynomial (the n-th prime ~ n log n is here, NOT polynomial)
- "oscillating" — does not have a clear monotonic growth (alternating signs, periodic, ±1 patterns)
- "unknown" — growth rate not determinable from the data given

Boundary rule: for ranked-list sequences (n-th prime, n-th squarefree), classify by how the n-th
ELEMENT grows, not by how the count of qualifying integers grows. The n-th prime grows ~ n log n,
which is "logarithmic_or_subpoly" (not "polynomial").
</growth-class>

<origin-era>
When was the underlying mathematical concept first studied? (NOT when the OEIS entry was created.)
Use the name, comments, formulas, **Author**, and **LastEdited** year to infer.
- "classical_pre1900" — known before 1900 (Fibonacci, primes, Catalan, factorials, π digits, Pascal,
  Euler totient, Bernoulli numbers, polygonal numbers, ancient combinatorial puzzles)
- "early_20c_1900_1950" — first studied 1900–1950 (Ramanujan partitions, Hardy-Littlewood era,
  Polya enumeration, Bell numbers, early algebraic combinatorics)
- "mid_20c_1950_2000" — first studied 1950–2000 (most computer-era sequences, OEIS founding era,
  Conway's recreational sequences, early CA work, modern algorithmic combinatorics)
- "modern_post2000" — first studied after 2000 (21st-century OEIS contributions, modern combinatorics
  papers, recent CA enumeration, contemporary number-theoretic experiments)
- "unknown" — cannot determine

How to use the Author and LastEdited fields:
- If **Author** contains an explicit year (e.g., "_Wolfdieter Lang_, May 04 2018"), that year is
  when the OEIS entry was created — a strong upper bound but NOT necessarily the original era.
  Combine it with the name/comments: if the comments establish pre-1900 pedigree (e.g., "studied
  by Euler", "Leonardo of Pisa, 1202"), use the historical era; otherwise the explicit Author year
  is the best signal we have.
- If Author is "_N. J. A. Sloane_" with NO year, the entry comes from the OEIS founder's seed
  population and could be any era — fall back to name/comments analysis.
- LastEdited is the most-recent-edit year, NOT the original authorship date. Use it ONLY as a
  weak negative signal: a sequence with LastEdited 2020+, no historical pedigree in name/comments,
  and a modern Author name leans toward modern_post2000 or mid_20c_1950_2000.

Default: if the sequence is named after a pre-1900 mathematician or refers to an ancient object,
pick "classical_pre1900" regardless of Author/LastEdited dates. For obscure sequences with comments
referencing modern computer experiments and no clear historical pedigree, prefer "modern_post2000"
when the Author year is post-2000, else "mid_20c_1950_2000". When truly uncertain, use "unknown" —
but try to commit to an era when there is even weak evidence.
</origin-era>

<examples>
A000045 Fibonacci numbers: F(n) = F(n-1) + F(n-2), values 0, 1, 1, 2, 3, 5, 8, 13, 21, …
→ math_domain=combinatorics, sequence_type=recurrence, growth_class=exponential, origin_era=classical_pre1900
Reasoning: counts pairs in many bijections; classic two-term recurrence; golden ratio growth; Leonardo of Pisa, 1202.

A000040 The prime numbers: 2, 3, 5, 7, 11, 13, 17, 19, 23, …
→ math_domain=number_theory, sequence_type=ranked_list,
  growth_class=logarithmic_or_subpoly, origin_era=classical_pre1900
Reasoning: number theory's foundational object; the n-th prime grows ~ n log n (subpolynomial); studied since Euclid.

A000108 Catalan numbers: 1, 1, 2, 5, 14, 42, 132, 429, …
→ math_domain=combinatorics, sequence_type=enumeration, growth_class=exponential, origin_era=classical_pre1900
Reasoning: counts many structures (Dyck paths, binary trees, triangulations); 4^n / n^(3/2) growth; Catalan 1838.

A000796 Decimal expansion of Pi: 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, …
→ math_domain=analysis, sequence_type=constant_digits, growth_class=bounded, origin_era=classical_pre1900
Reasoning: digits of a transcendental constant; bounded between 0 and 9; π is studied since antiquity.

A007318 Pascal's triangle read by rows: 1, 1, 1, 1, 2, 1, 1, 3, 3, 1, 1, 4, 6, 4, 1, …
→ math_domain=combinatorics, sequence_type=table_flattened, growth_class=exponential, origin_era=classical_pre1900
Reasoning: a flattened triangular table of binomial coefficients; central column ~ 2^n / sqrt(n); Pascal 1654.
</examples>

Using the visualization #

What each colormap means #

ColormapDescription
Clusters (default) Five hierarchical layers of Toponymy cluster names (8 / 16 / 49 / 173 / 597 unique labels, coarsest first). Cluster labels appear directly on the map and adapt to the current zoom level.
Math domain (LLM) The single most-fitting mathematical area. The most populated buckets in the curated set are number theory and combinatorics.
Sequence type (LLM) What each term structurally represents.
Growth class (LLM) Asymptotic growth rate, classified from the leading values and the formula. Ranked-list sequences (the n-th prime, the n-th squarefree integer) are classified by how the n-th element grows, not by how its index counts; the n-th prime is logarithmic-or-subpoly (it grows like n log n), not polynomial.
Origin era (LLM) When the underlying concept was first studied. The model uses the sequence name, comments, and author together; classical-pre1900 wins whenever the sequence is named after a pre-1900 mathematician or refers to an ancient object, regardless of when the OEIS entry was added.
Author The primary author of the OEIS entry. Bucketed to the top 10 most-prolific contributors plus "Other" for legibility on the colormap; the hover card always shows the author's full name when it could be parsed from the %A line.
Edit count (log10) Base-10 logarithm of the OEIS entry's revision count. Log scale spreads out the long tail.
References (log10 + 1) Base-10 logarithm of (1 + number of book and paper citations).

Each sequence's full OEIS keyword list (core, nice, easy, tabl, mult, nonn, and so on) appears at the bottom of the hover card. Definitions are on the OEIS keyword wiki.

Notable parameters #

Key configuration values used across the pipeline. Some appear inline in the descriptions above; the table below collects them in one place as the authoritative reference.

ParameterValueNotes
Corpus
Sourceoeis/oeisdataSparse checkout via scripts/sync_seq.sh
Total sequences upstream394,561As of April 2026
Curated target25,000The published map
Hard-exclude keywordsdead, dupe, uned, dumbApplied at all scopes
Min visible terms8Sequences with fewer terms are excluded
Curated seedcorenice~6,900 sequences
Embedding text budgets (per sequence, stage 04)
Name truncation300 chars
Comment truncation1,500 chars
Formula truncation500 chars
Example truncation300 chars
Values shown15Leading numeric terms kept for embedding hint and tooltip
LLM classification prompt budgets (stage 03)
Comment truncation800 charsTighter than the embed-side budget; prompt economy
Formula truncation400 chars
Author truncation100 chars
LLM classification (stage 03)
ModelClaude Sonnet 4.5claude-sonnet-4-5
Batch size25 sequencesPer tool-use call
Concurrency30Async semaphore
Max retries5Exponential backoff on rate-limit and API errors
Embedding (stage 04)
ModelCohere embed-v4.0input_type="clustering"
Dimensions512
Batch size96Cohere API call
Content keywordsexcluded from embed textSo they remain an orthogonal eval signal
UMAP (stage 05)
n_neighbors15Local neighborhood size
min_dist0.05Controls cluster tightness
Metriccosine
random_state42Reproducibility
Topic labeling (stage 06)
ModelClaude Sonnet 4.5Toponymy AsyncAnthropicNamer
min_clusters4Toponymy clusterer minimum
Detail levels0.5–1.0
Object description"OEIS integer sequences"
Layers (curated)58 / 16 / 49 / 173 / 597 unique labels coarsest→finest
Visualization (stage 07)
Marker sizesqrt(edit_count) → [3, 15] pxLinear stretch; high-edit sequences are the largest dots
Author bucketingTop 10 + "Other"Top by per-author sequence count
Click handleroeis.org/{id}Opens in a new tab

Limitations #

This is one map of the OEIS, not the only possible one. As with any map, the map is not the territory: it's a projection shaped by specific choices about what to measure and how to render the result. To paraphrase George Box: all models are wrong, but some are useful. I think this map is useful, but it's worth understanding where it falls short.

Credits #

Sequence data is from the OEIS Foundation and the thousands of contributors whose work fills the encyclopedia. Embeddings are from Cohere; LLM tagging and cluster naming use Anthropic Claude Sonnet 4.5. The pipeline depends on open-source tools from the Tutte Institute for Mathematics and Computing: UMAP for dimensionality reduction, Toponymy for hierarchical topic labeling, and DataMapPlot for interactive map rendering.