Kameldip Singh Basra | February 2026
The Voynich Manuscript (Beinecke MS 408, carbon-dated 1404-1438 CE) is identified as a 15th-century Elu-Sinhala pharmaceutical text -- a teaching manual recording a physician's spoken instructions for Ayurvedic preparations. The writing system is a bespoke abugida mapping 27 EVA characters to 14 Sinhala phonemes.
Key results:
- 94.4% of 35,916 tokens glossable in English (33,916 tokens)
- 99.7% matching Sinhala dictionary (Tier 1+2+3)
- Only 0.3% truly unknown
- Keyword-section clustering: decoded semantic profiles differ by manuscript section (Z=30.30, 0/1000 shuffles), converging with Montemurro & Zanette (2013) from independent direction
- Text-image convergence: Datura f16v illustration match (decoded
atalabel + unmistakable spiny seed capsule illustration) - Bathing section decoded as Ayurvedic balneotherapy (snana): f78v vocabulary dominated by pharmaceutical preparation terms
- Systematic naming conventions: herbal first-word = plant name (67/112 hapax), recipe headers = preparation type, zodiac
yk-sign prefix - Biological section (f75-f84) reinterpreted as gynecological pharmaceutical formulary (97% gloss rate)
- Three new decoder rules: word-final o→a (Rule 31), cf→ch (Rule 32), cs→s (Rule 33)
- Recipe internal structure: p/f-initial headers mark recipe boundaries (χ²=820.83, 6.7× enrichment)
- Cross-tradition attestation: Bhesajjamanjusa (13th c.) matches 10/13 decoded terms
- Decoded plant labels independently match Petersen's botanical illustration IDs
- Directionality flip: EVA is RTL-optimized; decoded text is LTR-optimized — consistent with abugida encoding (Parisel 2025)
- SOV syntax consistent across all manuscript sections (Z=8.10 postpositional, Z=5.07 noun-before-verb; Greenberg 1963 Universal 4)
- Pharmaceutical collocations: 16/36 Ayurvedic pairs confirmed, H12 4.8x random decoder average
- External vocabulary validation: Z=3.4 against 150 independently-sourced pharmaceutical terms (non-circular)
- Fair cross-language test: Sinhala matches 13x more pharmaceutical terms than any other language (equalized vocabulary, 115 concepts, 6 languages)
This work follows an adversarial self-testing methodology. Every structural claim is tested against random decoders using the same decoder architecture with randomized consonant mappings. Tests that failed are documented alongside tests that passed. All scripts are reproducible from the public data.
Evidence classification:
- Primary (non-circular, significant under dual null model): Keyword-section clustering (Z=30.30, dual-null Z=3.40), external pharma vocabulary (Z=3.4, dual-null Z=2.31), pharmaceutical collocations (Z=5.32, N=200 constrained decoders). All three survive Bonferroni correction at α/3=0.017. Fisher's method combined p = 4.5 × 10⁻¹⁰ (conservative, excluding dominant clustering test).
- Conditional corroboration (significant given lexical validity, not standalone): SOV syntax (Z=8.10 vs shuffled word order, but Z=0.85 vs random decoders). Word-order statistics alone do not discriminate H12 from constrained random decoders. SOV is meaningful as corroboration once lexical-semantic signal is established by the primary tests.
- Supporting: Cross-language discrimination (13x raw, Z=1.73), directionality flip (RTL→LTR), domain clustering, cross-modal convergences.
- Explicitly circular: Panchavidha Z=7.2 (H12-decoded terms tested against H12 output). Excluded from all significance claims.
- Stability: External pharma and keyword clustering ROBUST under all perturbations. SOV NbV FRAGILE under vocabulary pruning (drops from 63% to 44%).
Honest disclosures:
- All raw dictionary Z-scores are negative (H12 matches fewer dictionary words than random decoders). The structural tests (SOV, Collocations, External Pharma) provide the stronger evidence because they test whether matched words form coherent medical text, not just raw counts.
- The cross-language pharmaceutical Z-score is 1.73 (below 2.0 for statistical significance), though the raw advantage is 13x.
- Collocation shuffled word order test failed (p=0.983) -- collocations driven by global word frequencies, not local adjacency.
- SOV syntax not significant as standalone discriminator under dual null model (constrained Z=0.85). Treated as conditional corroboration: given independently established lexical-semantic signal, decoded word order is significantly non-random (Z=8.10 vs shuffled). SOV NbV metric FRAGILE under vocabulary pruning (63.2% → 44.1% when top-10 words removed).
- Two tests failed (folio clustering, recipe sequencing) and are documented below.
We actively invite additional tests. If there is a test you believe would strengthen or weaken this hypothesis, please open an issue or submit a PR. See Contributing below.
├── main.tex # Full paper (LaTeX)
├── main.pdf # Compiled paper (PDF)
├── paper.md # Full paper (Markdown)
├── references.bib # Bibliography
├── VALIDATION_LOG.md # Chronological adversarial test log
├── data/
│ ├── voynich_eva_transcription.txt # EVA corpus (Takahashi transcription)
│ ├── decoded_vocabulary.tsv # 8,493-entry decoded vocabulary
│ ├── sinhala_english_dictionary_sourced.tsv # 174 externally-sourced meaning entries
│ ├── sinhala_dictionary.txt # 1.47M romanized Sinhala dictionary
│ ├── SOURCING_REPORT.md # Full audit trail for dictionary sourcing
│ ├── external_pharmaceutical_vocab.tsv # 150 independently-sourced pharma terms (156 entries incl. category duplicates)
│ ├── crosslang_pharmaceutical_vocab.tsv# 60 concepts x 5 languages
│ ├── equalized_pharmaceutical_vocab.tsv# 115 concepts x 6 languages (fair test)
│ ├── hebrew_wordlist.txt # Hebrew wordlist
│ ├── tamil_wordlist.txt # Tamil wordlist
│ ├── hindi_wordlist.txt # Hindi wordlist
│ ├── turkish_wordlist.txt # Turkish wordlist
│ ├── latin_wordlist.txt # Latin wordlist
│ └── arabic_wordlist.txt # Arabic wordlist
├── scripts/
│ ├── h12_decoder.py # Core H12 decoder
│ ├── validate_coverage.py # 4-tier coverage validation
│ ├── validate_vowel_final.py # Vowel-final constraint test
│ ├── validate_domain_clustering.py # Domain clustering analysis
│ ├── validate_phonotactics.py # Cross-language phonotactic test
│ ├── validate_external_pharma.py # External vocabulary validation (150 pharma terms)
│ ├── crosslang_discrimination_test.py # Cross-language discrimination (5 Indic languages)
│ ├── structural_multilang_test.py # Fair cross-language test (6 languages, equalized vocab)
│ ├── candidate_language_test.py # Full candidate language comparison (7 languages)
│ ├── decoder_specificity_test.py # Decoder specificity analysis
│ ├── loanword_isolation_test.py # Sinhala vs Pali loanword isolation
│ ├── sov_syntax_test.py # SOV word order validation (1000 scrambled controls)
│ ├── collocation_test.py # Pharmaceutical collocation validation (3 control tests)
│ ├── folio_pharma_clustering.py # Folio-section clustering (FAILED — documented)
│ ├── recipe_sequence_test.py # Recipe phase ordering (FAILED — documented)
│ ├── keyword_section_clustering.py # Keyword-section clustering (Montemurro replication)
│ ├── entropy_analysis.py # Entropy h2 and directionality analysis
│ ├── dual_null_model_test.py # Dual null model comparison (constrained + unconstrained)
│ ├── stability_robustness_test.py # Stability checks (bootstrap, pruning, sections, tokenization)
│ ├── multiple_testing_correction.py # Bonferroni / Holm-Bonferroni / BH-FDR corrections
│ ├── holdout_validation.py # Holdout (train/test split) validation
│ ├── translate_manuscript.py # Full manuscript translation
│ └── herbal_plant_guide.py # Herbal folio plant identification
├── results/
│ ├── structural_evidence_summary.txt # Master audit of all structural evidence
│ ├── structural_multilang.txt # Fair cross-language test results
│ ├── panchavidha_validation.txt # Panchavidha random comparison (Z=7.2)
│ ├── external_pharma_validation.txt # External pharma validation (Z=3.4 controlled)
│ ├── crosslang_discrimination.txt # Cross-language discrimination results
│ ├── candidate_language_validation.txt # Full candidate language comparison
│ ├── decoder_specificity.txt # Decoder specificity analysis
│ ├── loanword_isolation.txt # Loanword isolation results
│ ├── sov_syntax_validation.txt # SOV syntax validation results
│ ├── collocation_validation.txt # Pharmaceutical collocation results
│ ├── keyword_section_clustering.txt # Keyword-section clustering results
│ ├── entropy_directionality_analysis.txt # Entropy and directionality analysis
│ ├── dual_null_model_comparison.txt # Dual null model comparison results
│ ├── stability_robustness.txt # Stability and robustness check results
│ ├── multiple_testing_correction.txt # Multiple-testing correction results
│ └── holdout_validation.txt # Holdout validation results (3/3 PASS)
├── run_all.sh # One-command full rebuild + SHA256 checksums
└── translation/
├── voynich_translation.md # Full English translation
└── herbal_plant_guide.md # Plant identification guide
- Python 3.7+
- NumPy (
pip install numpy) — only required forvalidate_phonotactics.py
All other scripts use only the Python standard library.
# Decode a single word
python scripts/h12_decoder.py --word daiin
# Output: gena (Sinhala: "having taken")
# Decode the full corpus
python scripts/h12_decoder.py --input data/voynich_eva_transcription.txt --summary
# Run core validation scripts
python scripts/validate_coverage.py # Tier 1: 94.4%, Total known: 99.7%
python scripts/validate_vowel_final.py # 99.73% vowel-final
python scripts/validate_domain_clustering.py # 46% medical (9x over random baseline)
python scripts/validate_phonotactics.py # Sinhala ranks #2-3 across measures
# Run adversarial random-decoder comparisons
python scripts/validate_external_pharma.py # Z=3.4 controlled (150 external pharma terms)
python scripts/structural_multilang_test.py # Fair cross-language: Sinhala 13x (Z=1.73)
python scripts/candidate_language_test.py # Sinhala #1 by raw coverage (47.3%)
python scripts/decoder_specificity_test.py # H12 specificity analysis
python scripts/loanword_isolation_test.py # Sinhala vs Pali discrimination
# Run literature-derived analyses
python scripts/sov_syntax_test.py # SOV word order (Z=8.10 postposition, Z=5.07 NbV)
python scripts/collocation_test.py # Pharmaceutical collocations (16/36, H12 4.8x random)
python scripts/keyword_section_clustering.py # Keyword-section clustering (Z=30.30)
python scripts/entropy_analysis.py # Entropy h2 and directionality
# Run methodological robustness checks
python scripts/dual_null_model_test.py # Dual null model (constrained + unconstrained, 200 trials each)
python scripts/stability_robustness_test.py # Stability checks (bootstrap, pruning, sections, tokenization)
python scripts/multiple_testing_correction.py # Multiple-testing correction (Bonferroni, Holm, BH-FDR)
python scripts/holdout_validation.py # Holdout validation (3/3 PASS, odd/even folio split)
# Or run everything at once with checksums:
./run_all.sh # Full rebuild + SHA256 checksumsThe H12 decoder maps EVA (Extensible Voynich Alphabet) characters to Sinhala phonemes via abugida rules. The system uses a 14-phoneme inventory that corresponds to the spoken Elu dialect of medieval Sinhala.
Key mappings include:
shmaps to momaps to ud(onset) maps to gk(medial) maps to gch+ consonant triggers devoicingctmaps to th (aspirated dental)ckmaps to kh (aspirated velar)cpmaps to ph (aspirated labial)q/hare silent (null carriers)
The decoder applies positional rules (onset vs. medial vs. coda) and abugida vowel-handling conventions to produce phonetic Sinhala output from EVA input.
The decoder's meaning dictionary is fully externally sourced -- no AI-training-derived meanings are used.
- Primary source:
data/sinhala_english_dictionary_sourced.tsv-- 174 entries verified against published dictionaries (OSDB, Clough 1892, Monier-Williams, Digital Pali Dictionary, PTS Pali-English Dictionary) - Secondary source:
data/decoded_vocabulary.tsv-- 8,493 entries from compound splits and edit-distance-1 matching against the Sinhala dictionary - Full audit trail:
data/SOURCING_REPORT.mddocuments the verification status and external source for every entry
Each entry in the sourced dictionary carries a verification status (VERIFIED, VERIFIED-VERB, VERIFIED-COMPOUND, VERIFIED-ETYM, PARTIAL, or HYPOTHESIZED) and a citation to the specific dictionary and line/entry that confirms it.
The manuscript encodes spoken language -- a phonetic transcription of medieval speech -- not written language in cipher. This explains why:
- The 14-phoneme inventory matches pre-12th-century spoken Elu (no /b/, /v/, /f/)
- N-gram frequencies match spoken Sinhala (#1 when spoken-weighted)
- Edit-distance-1 matches are pronunciation variants, not errors
- Compounds run together as pronounced (reflecting continuous speech)
- Word-length distribution matches agglutinative spoken language
This "spoken, not written" framing resolves decades of failed cryptographic attacks: the manuscript was never a cipher, and classical decryption methods were searching for the wrong kind of structure.
Every test below can be reproduced by running the listed script. All Z-scores are computed against random decoders using the same architecture with randomized consonant mappings.
| Test | Script | Result | Significance |
|---|---|---|---|
| Panchavidha Kashaya Kalpana | structural_multilang_test.py |
6 decoded dosage form terms, 1,560 tokens, 1/200 random match | Z = 7.2 (CIRCULAR — see disclosures) |
| SOV syntax (postpositional) | sov_syntax_test.py |
77.1% postposition-after-noun | Z = 8.10 |
| SOV syntax (noun-before-verb) | sov_syntax_test.py |
66.2% noun-before-verb | Z = 5.07 |
| External pharma vocabulary | validate_external_pharma.py |
7,130 tokens from 150 published terms, 0/38 beat H12 (controlled) | Z = 3.4 |
| Pharmaceutical collocations | collocation_test.py |
16/36 Ayurvedic pairs, H12=16 vs random avg=3.3 | H12 4.8x random |
| Fair cross-language pharma | structural_multilang_test.py |
Sinhala 5,021 tokens, 13x next-best (equalized 115-concept vocab) | Z = 1.73 |
| Candidate language comparison | candidate_language_test.py |
Sinhala #1 by raw coverage (47.3%, 16,977 tokens); all Z-scores negative | Best raw coverage |
| Loanword isolation | loanword_isolation_test.py |
Sinhala-only Z=1.39, Pali at 25% of Sinhala (consistent with loanwords) | Weak pass |
| Keyword-section clustering | keyword_section_clustering.py |
Semantic profiles differ by section (chi2=2015, 1000 shuffles) | Z = 30.30 |
| Test | Script | Result |
|---|---|---|
| Dual null model | dual_null_model_test.py |
3/5 tests significant under both constrained and unconstrained nulls. Pharma Z=2.31/73.81, Collocations Z=5.32/inf, Clustering Z=3.40/116.95. SOV not significant under random decoders (Z=0.85/1.38). |
| Stability checks | stability_robustness_test.py |
2/3 claims ROBUST (external pharma, keyword clustering). SOV FRAGILE: NbV drops from 63% to 44% under top-10 vocab pruning. |
| Multiple-testing correction | multiple_testing_correction.py |
3/3 primary tests survive Bonferroni (α/3=0.017). SOV reclassified as conditional corroboration (fails dual-null standalone). Even correcting all 8 quantitative tests: 5/8 Bonferroni, 6/8 Holm-Bonferroni, 7/8 BH-FDR. |
| Holdout validation | holdout_validation.py |
3/3 holdout tests PASS (odd/even folio split). Pharma vocab Z=19.7 (14.8× random), collocations Z=21.0 (154/160 transfer), section prediction Z=4.3 (43.5% vs 14.2% chance). Fisher's method combined p = 4.5 × 10⁻¹⁰. |
| Test | Script | Result |
|---|---|---|
| Coverage | validate_coverage.py |
94.4% glossed, 99.7% total known, 0.3% unknown |
| Vowel-final constraint | validate_vowel_final.py |
99.73% vowel-final (Elu abugida) |
| Domain clustering | validate_domain_clustering.py |
46.1% medical vocabulary (9x over ~5% random baseline) |
| Phonotactics | validate_phonotactics.py |
Sinhala #2-3 across CV-pattern measures |
| Decoder specificity | decoder_specificity_test.py |
Sinhala #1 by raw coverage (+10,201 tokens); all delta Z-scores negative (-0.82) |
| Directionality flip | entropy_analysis.py |
EVA RTL-optimized (PP ratio 0.899), decoded LTR-optimized (1.804) |
| Entropy h2 | entropy_analysis.py |
EVA h2=2.358, decoded h2=2.339 (delta -0.019, not closer to natural) |
| Finding | Details |
|---|---|
| All dictionary Z-scores negative | H12 matches fewer dictionary words than random decoders for ALL languages. Sinhala delta Z = -0.82 (Hebrew is least negative at -0.73 but matches 0 tokens). Random decoders explore more of the dictionary space. |
| Cross-language Z below 2.0 | Fair test Z=1.73. Raw advantage (13x) is clear but not statistically significant by this metric alone. |
| Generic CV hypothesis | Cannot be fully rejected by dictionary matching alone. The structural tests (Panchavidha, SOV, Collocations) are the discriminating evidence. |
| Decoded entropy unchanged | H12 decoded h2=2.339 vs EVA h2=2.358 (delta -0.019). Decoding does not move entropy closer to natural language (~3.3 bits). |
| Test | Script | Result | Why It Failed |
|---|---|---|---|
| Folio-section clustering | folio_pharma_clustering.py |
Z = 0.2 | Entire manuscript is medical — no non-medical control sections |
| Recipe phase ordering | recipe_sequence_test.py |
All 3 metrics failed | Matched vocabulary too generic to reveal recipe phases |
These involve two independent modalities (decoded text + illustrations) converging on the same conclusion. No random decoder can match illustrations.
- Petersen botanical IDs: 3 plants independently identified by botanist and by H12 (tamala, kamala, tambula)
- Rajas on women's folios: decoded "ra" (menstrual/female principle) concentrates in sections illustrating women
- Recipe-illustration coherence: pharmaceutical sections with vessel illustrations decode as processing steps; herbal sections with plant illustrations decode as plant parts
We welcome adversarial testing. If you can think of a test that would falsify or strengthen this hypothesis, we want to hear it.
Ways to contribute:
- Open an issue describing a test you'd like to see. We will implement and run it, and publish the results regardless of outcome (as we have with our failed tests).
- Submit a PR with a new validation script. Follow the pattern of existing scripts: decode the corpus with H12, run the same test on 200 random decoders, report the Z-score.
- Attempt hostile replication: clone the repo, run the scripts, report what you find.
- Challenge a specific claim: if any number in this README doesn't match what you get when running the script, file a bug.
Tests we would particularly value from the community:
- Independent Sinhala/Elu linguistic assessment of the decoded text
- Additional cross-language discrimination with larger vocabularies
- Statistical tests for hidden structure that we haven't thought of
- Comparison against other proposed Voynich decipherments using the same random-decoder methodology
The full audit trail of all tests (passed and failed) is in results/structural_evidence_summary.txt and VALIDATION_LOG.md.
@article{Basra2026,
author = {Basra, Kameldip Singh},
title = {The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala},
year = {2026},
doi = {10.5281/zenodo.18644772},
url = {https://doi.org/10.5281/zenodo.18644772}
}This work is licensed under CC BY-NC 4.0 -- free for personal and academic research use. For commercial licensing, contact kameldipbasra@gmail.com. See LICENSE file.