# Structural properties of Voynichese under adversarial controls: morphosyntax, production drift, and a falsification battery

**Burak Genç** (independent researcher)
**Draft manuscript — for community review (voynich.ninja / Cryptologia submission candidate)**
*Code and data pipeline: github.com/Bgenc48/voynich-data-investigation (canonical.py = single parsing source for every number herein). Version of 2026-06-12 (REPORT Parts 1–62).*

## Abstract

We report a controlled statistical investigation of the Voynich manuscript (Beinecke MS 408) text, designed so that every claim is a falsifiable test with an explicit null model, and validated by an adversarial audit round including transcription-independence, production-block holdout, and layer-ablation tests. Findings at the audited level of confidence: (1) Voynichese exhibits genuine rule-governed class-sequential structure — word-final alternations (EVA *-edy/-ey*, *-dy/-y*) shift following-word context consistently across ≥60 unrelated stems (split-half consistency r ≈ 0.55, permutation z > 4), an effect that reproduces in the independent v101 transliteration (r 0.66–0.81), survives removal of all line-edge and p/f/m tokens, generalizes across disjoint physical quires at the in-distribution ceiling for the strong pair, and replicates out-of-sample in Currier A on independently-discovered markers (r 0.63–0.65); a context-free self-citation generator matching all surface statistics scores zero, though a meaningless class-conditioned Markov process does reproduce it — so this evidences sequential rule-governance, not semantic content. (2) Section-conditioned vocabulary survives a within-scribe control. (3) Word-internal glyph order is notation-like (concordance 0.84 vs ~0.64 for natural languages, 0.98 for Roman numerals) and word-internal e/i run-length is a free variable unattested in language controls including phonemic-length Finnish. (4) The page-level mean of that variable behaves as *production drift*, not semantics: it is page-stable (split-half r = 0.90) yet varies smoothly along the physical book (adjacent-leaf r = 0.71 vs 0.50 permuted, z = 5.4). Repurposed as a production tracer, it independently confirms the bifolium-unit production hypothesis of Davis (bifolium sheet-mates are the most similar leaf pairs, z = −3.8), quantifies binding disorder by section (disorder index 0.59–0.89), and flags one undocumented within-quire discontinuity (f107→f108). (5) A 13-language code-invariant screen places the word-stream typology in the inflection-rich tier (Finnish/Greek/Hungarian closest; English worst). Several semantic hypotheses we ourselves raised — line-final totals, zodiac day-numbers, page-specific plant names (single- and multi-token), locally contextual operator values, subject-graded operator values — were each rejected by their own controls; we document these negative results. We conclude that Voynichese is a rule-governed formal system with real, transcription-independent grammar and production-order structure, but that item-level meaning is not recoverable from the text alone, and we publish the full battery as a falsification harness for future decipherment claims.

## 1. Data and reproducibility

Primary text: Zandbergen–Landini transliteration ZL 3b (May 2025, voynich.nu), IVTFF format, parsed by a single canonical parser (`canonical.py`) that emits a token table with page, quire ($Q), bifolio ($B), hand ($H, after Davis 2020), section ($I), Currier language ($L), line, and position. Headline counts: 33,122 running-text tokens (6,530 types); Currier A 10,105; Currier B 22,496 (4,590 types); 1,138 label tokens; 25-glyph EVA alphabet. Replication transliteration: GC (v101) from the same repository. Controls: Latin (Caesar), English (Gutenberg prose), plus Italian, Spanish, German, Finnish, Hungarian, Turkish, Hebrew, Arabic (Quran, dediacritized), Persian, Greek, Czech at matched token counts; a 17th-century recipe compendium (Digby) as genre control; and a fitted 11-parameter self-citation generator as the artifact control throughout. All scripts are stdlib Python; one command reproduces each table.

## 2. Results

### 2.1 Surface statistics (replication of prior art)
Conditional character entropy h2 = 1.98 bits (Currier B; cf. Bennett 1976; Lindemann & Bowern 2020), Zipf slope −1.07, narrow binomial word lengths (5.09 ± 1.77), dense edit-distance-1 vocabulary network (87% connectivity vs ~54% size-matched language controls), and strong layout effects (line-final m-forms 17×; paragraph-initial p/f words 17×; gallows line-initial 9×). A fitted generator (slot table + reuse/mutate/novel + layout rules) reproduces h2 to the third decimal and most surface statistics, but underproduces word types by ~30% — and fails every structural test below.

### 2.2 Morphosyntax (the central positive result)
For suffix pairs (s1, s2) and stems x with both forms attested ≥4×, the direction of the next-word log-odds difference replicates across disjoint random halves of the stem set: EVA -edy/-ey r = 0.554 (64 stems), -dy/-y r = 0.560 (86 stems), permutation null ≈ 0, z = 4.3–6.3; English morphology on the same pipeline: -s/∅ r = 0.498. The generator: r ≈ 0.00.
**Adversarial validation:** (i) v101 transliteration — analogous final-unit pairs (e/y, m/y, n/y) r = 0.66–0.81, z = 3.1–4.1, and these collapse under a within-line word-order shuffle (which also correctly exposes two spurious pairs the permutation null alone would pass — we recommend this control to the field). (ii) Quire-blocked holdout — train context vectors on one disjoint quire group, test on the other: -edy/-ey r = 0.553 vs in-distribution ceiling 0.547 (ratio 1.01; adversarial T-vs-M split r = 0.555); -dy/-y attenuates to ratio 0.54 and the claim is correspondingly narrowed. (iii) Layer ablation — removing line-final words, paragraph-initial words, all p/f/m words, or all line-edge words leaves r = 0.51–0.56 (z = 4.1–5.8).
Context-only clustering of the 236 most frequent Currier-B words yields eight coherent classes whose transitions carry 0.163 bits (English 0.639, Latin 0.248, generator 0.002), with morphology–syntax alignment and dedicated line-opener/line-closer classes.
**Scope of the claim (steelman).** A meaningless class-conditioned Markov process (word ~ P(word | previous word's suffix-class)), trained on Currier B, reproduces the suffix-context consistency (r 0.46–0.51 vs real 0.53–0.55); a context-free generator does not (≈0). We therefore claim only that Voynichese exhibits **rule-governed class-sequential structure stronger than context-free pseudo-text** — not that the grammar metric evidences semantic content. The structure is genuinely present in the manuscript (M1 copies real class-transition statistics); it does not, by this test, imply recoverable meaning.

### 2.3 Topic structure with scribe control
Jensen–Shannon divergence between illustration-defined sections exceeds within-section page-split baselines at ratio 1.29 (generator 1.09, English-in-same-shapes 1.10, shuffled 1.00); 96 word types are section-locked at G² > 10.83 (generator 31). Critically, the effect persists within a single hand: hand 2 herbal-vs-biological ratio 1.46. The only unexcluded confound is fine-grained production-time drift.

### 2.4 Notation-like word internals; the free run-length variable
Glyph-order concordance under a data-inferred global order: Voynich 0.840 (excess over within-word shuffle null +0.336) vs English/Latin ≈ 0.64 (+0.14), Roman numerals 0.977 (+0.475), decimal digits 0.525 (+0.02). Word-internal e/i run lengths are lexically *free*: 136/163 e-frames and 44/45 i-frames occur with multiple lengths (weighted flexibility 0.41–0.50) vs English ≤0.014 and Finnish (true phonemic length) ≤0.029; within single lines the same frame recurs with a different length 32–35% of the time (English: 0.0%).

### 2.5 The run-length variable is production drift, not semantics
Layout explains <0.5% of within-item run-length variance (rejecting a graphic line-fill account), but no locally contextual component survives controls either. The page-level mean is highly reliable (split-half r = 0.902) yet autocorrelates along the current folio order beyond section+language (r = 0.714 vs permuted 0.498, z = 5.4), with a continuous unimodal distribution across herbal pages — the signature of slowly drifting scribal register, not a graded subject attribute. We therefore retract semantic readings of this variable proposed in earlier drafts of our own work.

### 2.6 Codicological applications
Treating leaf-level drift as a production tracer: (i) bifolium sheet-mates — far apart in reading order — are the most similar leaf pairs (|Δ| = 0.105 vs reading-adjacent 0.144 vs section-random 0.185 ± 0.021; z = −3.8), independently confirming bifolium-unit production (Davis); (ii) the largest current-order discontinuity (f40→f41, z = 4.0) coincides with a documented quire+hand seam, validating the detector; (iii) one strong undocumented anomaly emerges within quire T, same hand, adjacent bifolia: f107→f108 (within-block z = 2.9) — a checkable candidate disordering; (iv) binding disorder indices (current order on a random↔optimal smoothness scale): Herbal-A 0.70 (still smoother than random, z = −3.4), Stars-B 0.59, Herbal-B 0.89 — the first quantitative bound on rebinding disturbance. We decline to publish a reconstructed page order: only one feature drifts significantly, and sorting one noisy tracer would overfit by construction.

### 2.7 Typological screen (code-invariant)
Under a word-level code, word-stream statistics pass through unchanged. Across 13 candidate languages at matched size, the closest fingerprints are Finnish, Greek, Hungarian, Czech; English is decisively last. Word-class self-chaining >1.0 occurs only in Finnish (1.36), Voynich (1.34), Hebrew (1.20), Turkish (1.00). No language matches the full fingerprint; the residuals (vocabulary growth, within-line repetition) are consistent with — but do not prove — homophonic coding conventions.

### 2.8 Negative results (each killed by its own control)
Line-final "totals" (position-specificity control); zodiac day-number labels; page-specific plant names as single tokens, bigrams, trigrams, or line-initial forms (burstiness-preserving null); locally contextual operator; subject-graded operator; flattened-table reading of lines (global-edge-gradient null). We emphasize two methodological findings: naive permutation nulls and line-shuffle nulls each passed false positives (up to z = 6.6) that the correct null exposed; Voynich work requires burstiness-preserving and within-line-shuffle controls as standard practice.

### 2.9 Second-audit round: robustness, exclusions, and the long-range information profile

A second adversarial round (REPORT Parts 53–62) added five results, each pre-registered:

**(a) Fresh-environment replication.** A clean clone on a second machine, with primary data hosts
unreachable (pinned mirrors substituted; ZL3b verified byte-identical), reproduces every headline
number exactly. Six scripts carried non-portable paths; fixed upstream.

**(b) Transcription-choice sensitivity.** The two silent parsing decisions (alternate readings →
first option; uncertain spaces → breaks) were varied over the full 2×2 grid: h2 moves ≤ 0.013
bits, the morphosyntax r stays 0.50–0.55, the BPE unit inventory is unchanged. No headline claim
is transcription-fragile.

**(c) Vowel harmony is absent, with validated positive controls.** A harmony detector that
rediscovers Turkish's back/front classes exactly (z = 71) and detects harmony through
syllable-sized units (z = 18 on syllabified Turkish) scores Voynichese at z ≈ 0 at both the
Sukhotin-vowel and BPE-unit levels, in A and B independently (period Kipchak corpus built from
the Codex Cumanicus for this test). Consequence: sign-level phonetic encodings of harmonic
languages — including the typology leaders Finnish and Hungarian (§2.7) — are excluded; harmonic
plaintexts survive only as word-level codes or vowel-suppressing spellings.

**(d) No discourse-scale information.** Excess suffix-class mutual information by token distance
(bias-corrected against 20 shuffle surrogates), with self-citation and class-Markov generator
controls: real B carries excess at all distances, but structural scrambles localize the entire
d ≥ 8 signal to page-level class composition plus neighboring-page drift — it survives
line-order-within-page shuffling and dies with page identity. Every language control, including a
recipe-genre corpus, keeps stationary class usage instead. Word-identity repetition decay tracks
the fitted self-citation generator to d = 256. This is the first direct evidence on the
meaning-vs-meaning-thin question: continuous prose underneath is strongly disfavored; meaningful
content remains viable only in line-record form.

**(e) Two mechanisms excluded; one result hardened.** Rugg's table-and-grille, minimally fitted
(60 configs), fails on type count (214 vs 4,584), entropy, grammar shape, and a comb-like MI(d)
fingerprint. Conversely the central morphosyntax result survives space-blind re-segmentation
(per-line BPE; segmenter validated on Latin/English boundary recovery): r up to +0.54 on a
tokenization that never saw a scribal space — the grammar is a property of the glyph stream.
Scribal spaces are themselves more predictable from glyph statistics (F1 0.74–0.77) than true
word boundaries are in Latin (0.43) or English (0.61): spaces behave as eye-grouped cut points.

## 3. Discussion

The audited evidence supports this statement and no stronger one: *Voynichese is a rule-governed graphic system with transcription-independent distributional morphosyntax, section-conditioned vocabulary robust to scribe identity, notation-like word internals with one free non-semantic production variable, and line-template organization — much closer to a formalized register, notation, or record system than to transcribed speech, and incompatible with simple substitution of any tested language.* Whether the system encodes recoverable meaning (e.g., a word-level nomenclator whose key is lost — historically attested practice) or instantiates a meaning-thin formal convention cannot be decided from the text alone; every internal anchor we tested that could bind tokens to referents failed under controls. The constructive contributions are the falsification battery (any future decipherment claim must reproduce ~15 measured constraints), the audited grammar, and the production-forensics tracer with one confirmed codicological hypothesis.

**Update after the second round.** The surviving readings are narrowed but the wall stands:
(i) meaning-thin generation with slowly drifting production habits now has positive long-range
evidence, not just unfalsifiability; (ii) meaningful content survives only as line-records (the
genre the imagery suggests), with harmonic-language plaintexts excluded at sign level; continuous
prose is disfavored. Any future decipherment claim must now also reproduce the absence of
discourse-scale information, the absence of vowel harmony, and the space-robustness of the
grammar (falsification battery extended accordingly).

## 4. Limitations

Exploratory breadth implies multiple-testing risk; the central results were therefore re-validated under preregistered-style holdouts (disjoint quires, independent transliteration), but community replication is the real test. The drift tracer is single-feature. Genre controls remain thin for period-correct comparanda (medieval litanies, inventories, account books). Section, hand, quire, and language are partially confounded by the manuscript's construction; we blocked where possible.

## 5. Data and code availability

All scripts, the canonical parser, the audit suite, and an interactive explainer are at the repository above; source corpora are fetched from their original maintainers (voynich.nu; Yale Beinecke digital collections) rather than redistributed. We thank the maintainers of the ZL and v101 transliterations, and prior work by Currier, Bennett, Stolfi, Montemurro & Zanette, Timm & Schinner, Lindemann & Bowern, and Davis, on which this study builds. An independent adversarial review (2026) prescribing part of the audit design is included in the repository.