The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

Kameldip Singh Basra | February 2026

Summary

The Voynich Manuscript (Beinecke MS 408, carbon-dated 1404-1438 CE) is identified as a 15th-century Elu-Sinhala pharmaceutical text -- a teaching manual recording a physician's spoken instructions for Ayurvedic preparations. The writing system is a bespoke abugida mapping 27 EVA characters to 14 Sinhala phonemes.

Key results:

94.4% of 35,916 tokens glossable in English (33,916 tokens)
99.7% matching Sinhala dictionary (Tier 1+2+3)
Only 0.3% truly unknown
Keyword-section clustering: decoded semantic profiles differ by manuscript section (Z=30.30, 0/1000 shuffles), converging with Montemurro & Zanette (2013) from independent direction
Text-image convergence: Datura f16v illustration match (decoded ata label + unmistakable spiny seed capsule illustration)
Bathing section decoded as Ayurvedic balneotherapy (snana): f78v vocabulary dominated by pharmaceutical preparation terms
Systematic naming conventions: herbal first-word = plant name (67/112 hapax), recipe headers = preparation type, zodiac yk- sign prefix
Biological section (f75-f84) reinterpreted as gynecological pharmaceutical formulary (97% gloss rate)
Three new decoder rules: word-final o→a (Rule 31), cf→ch (Rule 32), cs→s (Rule 33)
Recipe internal structure: p/f-initial headers mark recipe boundaries (χ²=820.83, 6.7× enrichment)
Cross-tradition attestation: Bhesajjamanjusa (13th c.) matches 10/13 decoded terms
Decoded plant labels independently match Petersen's botanical illustration IDs
Directionality flip: EVA is RTL-optimized; decoded text is LTR-optimized — consistent with abugida encoding (Parisel 2025)
SOV syntax consistent across all manuscript sections (Z=8.10 postpositional, Z=5.07 noun-before-verb; Greenberg 1963 Universal 4)
Pharmaceutical collocations: 16/36 Ayurvedic pairs confirmed, H12 4.8x random decoder average
External vocabulary validation: Z=3.4 against 150 independently-sourced pharmaceutical terms (non-circular)
Fair cross-language test: Sinhala matches 13x more pharmaceutical terms than any other language (equalized vocabulary, 115 concepts, 6 languages)

Validation Philosophy

This work follows an adversarial self-testing methodology. Every structural claim is tested against random decoders using the same decoder architecture with randomized consonant mappings. Tests that failed are documented alongside tests that passed. All scripts are reproducible from the public data.

Evidence classification:

Primary (non-circular, significant under dual null model): Keyword-section clustering (Z=30.30, dual-null Z=3.40), external pharma vocabulary (Z=3.4, dual-null Z=2.31), pharmaceutical collocations (Z=5.32, N=200 constrained decoders). All three survive Bonferroni correction at α/3=0.017. Fisher's method combined p = 4.5 × 10⁻¹⁰ (conservative, excluding dominant clustering test).
Conditional corroboration (significant given lexical validity, not standalone): SOV syntax (Z=8.10 vs shuffled word order, but Z=0.85 vs random decoders). Word-order statistics alone do not discriminate H12 from constrained random decoders. SOV is meaningful as corroboration once lexical-semantic signal is established by the primary tests.
Supporting: Cross-language discrimination (13x raw, Z=1.73), directionality flip (RTL→LTR), domain clustering, cross-modal convergences.
Explicitly circular: Panchavidha Z=7.2 (H12-decoded terms tested against H12 output). Excluded from all significance claims.
Stability: External pharma and keyword clustering ROBUST under all perturbations. SOV NbV FRAGILE under vocabulary pruning (drops from 63% to 44%).

Honest disclosures:

All raw dictionary Z-scores are negative (H12 matches fewer dictionary words than random decoders). The structural tests (SOV, Collocations, External Pharma) provide the stronger evidence because they test whether matched words form coherent medical text, not just raw counts.
The cross-language pharmaceutical Z-score is 1.73 (below 2.0 for statistical significance), though the raw advantage is 13x.
Collocation shuffled word order test failed (p=0.983) -- collocations driven by global word frequencies, not local adjacency.
SOV syntax not significant as standalone discriminator under dual null model (constrained Z=0.85). Treated as conditional corroboration: given independently established lexical-semantic signal, decoded word order is significantly non-random (Z=8.10 vs shuffled). SOV NbV metric FRAGILE under vocabulary pruning (63.2% → 44.1% when top-10 words removed).
Two tests failed (folio clustering, recipe sequencing) and are documented below.

We actively invite additional tests. If there is a test you believe would strengthen or weaken this hypothesis, please open an issue or submit a PR. See Contributing below.

Repository Structure

├── main.tex                              # Full paper (LaTeX)
├── main.pdf                              # Compiled paper (PDF)
├── paper.md                              # Full paper (Markdown)
├── references.bib                        # Bibliography
├── VALIDATION_LOG.md                     # Chronological adversarial test log
├── data/
│   ├── voynich_eva_transcription.txt     # EVA corpus (Takahashi transcription)
│   ├── decoded_vocabulary.tsv            # 8,493-entry decoded vocabulary
│   ├── sinhala_english_dictionary_sourced.tsv # 174 externally-sourced meaning entries
│   ├── sinhala_dictionary.txt            # 1.47M romanized Sinhala dictionary
│   ├── SOURCING_REPORT.md               # Full audit trail for dictionary sourcing
│   ├── external_pharmaceutical_vocab.tsv # 150 independently-sourced pharma terms (156 entries incl. category duplicates)
│   ├── crosslang_pharmaceutical_vocab.tsv# 60 concepts x 5 languages
│   ├── equalized_pharmaceutical_vocab.tsv# 115 concepts x 6 languages (fair test)
│   ├── hebrew_wordlist.txt               # Hebrew wordlist
│   ├── tamil_wordlist.txt                # Tamil wordlist
│   ├── hindi_wordlist.txt                # Hindi wordlist
│   ├── turkish_wordlist.txt              # Turkish wordlist
│   ├── latin_wordlist.txt                # Latin wordlist
│   └── arabic_wordlist.txt               # Arabic wordlist
├── scripts/
│   ├── h12_decoder.py                    # Core H12 decoder
│   ├── validate_coverage.py              # 4-tier coverage validation
│   ├── validate_vowel_final.py           # Vowel-final constraint test
│   ├── validate_domain_clustering.py     # Domain clustering analysis
│   ├── validate_phonotactics.py          # Cross-language phonotactic test
│   ├── validate_external_pharma.py       # External vocabulary validation (150 pharma terms)
│   ├── crosslang_discrimination_test.py  # Cross-language discrimination (5 Indic languages)
│   ├── structural_multilang_test.py      # Fair cross-language test (6 languages, equalized vocab)
│   ├── candidate_language_test.py        # Full candidate language comparison (7 languages)
│   ├── decoder_specificity_test.py       # Decoder specificity analysis
│   ├── loanword_isolation_test.py        # Sinhala vs Pali loanword isolation
│   ├── sov_syntax_test.py                # SOV word order validation (1000 scrambled controls)
│   ├── collocation_test.py               # Pharmaceutical collocation validation (3 control tests)
│   ├── folio_pharma_clustering.py        # Folio-section clustering (FAILED — documented)
│   ├── recipe_sequence_test.py           # Recipe phase ordering (FAILED — documented)
│   ├── keyword_section_clustering.py     # Keyword-section clustering (Montemurro replication)
│   ├── entropy_analysis.py              # Entropy h2 and directionality analysis
│   ├── dual_null_model_test.py          # Dual null model comparison (constrained + unconstrained)
│   ├── stability_robustness_test.py     # Stability checks (bootstrap, pruning, sections, tokenization)
│   ├── multiple_testing_correction.py   # Bonferroni / Holm-Bonferroni / BH-FDR corrections
│   ├── holdout_validation.py            # Holdout (train/test split) validation
│   ├── translate_manuscript.py           # Full manuscript translation
│   └── herbal_plant_guide.py             # Herbal folio plant identification
├── results/
│   ├── structural_evidence_summary.txt   # Master audit of all structural evidence
│   ├── structural_multilang.txt          # Fair cross-language test results
│   ├── panchavidha_validation.txt        # Panchavidha random comparison (Z=7.2)
│   ├── external_pharma_validation.txt    # External pharma validation (Z=3.4 controlled)
│   ├── crosslang_discrimination.txt      # Cross-language discrimination results
│   ├── candidate_language_validation.txt # Full candidate language comparison
│   ├── decoder_specificity.txt           # Decoder specificity analysis
│   ├── loanword_isolation.txt            # Loanword isolation results
│   ├── sov_syntax_validation.txt          # SOV syntax validation results
│   ├── collocation_validation.txt         # Pharmaceutical collocation results
│   ├── keyword_section_clustering.txt    # Keyword-section clustering results
│   ├── entropy_directionality_analysis.txt # Entropy and directionality analysis
│   ├── dual_null_model_comparison.txt    # Dual null model comparison results
│   ├── stability_robustness.txt          # Stability and robustness check results
│   ├── multiple_testing_correction.txt   # Multiple-testing correction results
│   └── holdout_validation.txt           # Holdout validation results (3/3 PASS)
├── run_all.sh                            # One-command full rebuild + SHA256 checksums
└── translation/
    ├── voynich_translation.md            # Full English translation
    └── herbal_plant_guide.md             # Plant identification guide

Requirements

Python 3.7+
NumPy (pip install numpy) — only required for validate_phonotactics.py

All other scripts use only the Python standard library.

Quick Start

# Decode a single word
python scripts/h12_decoder.py --word daiin
# Output: gena (Sinhala: "having taken")

# Decode the full corpus
python scripts/h12_decoder.py --input data/voynich_eva_transcription.txt --summary

# Run core validation scripts
python scripts/validate_coverage.py              # Tier 1: 94.4%, Total known: 99.7%
python scripts/validate_vowel_final.py           # 99.73% vowel-final
python scripts/validate_domain_clustering.py     # 46% medical (9x over random baseline)
python scripts/validate_phonotactics.py          # Sinhala ranks #2-3 across measures

# Run adversarial random-decoder comparisons
python scripts/validate_external_pharma.py       # Z=3.4 controlled (150 external pharma terms)
python scripts/structural_multilang_test.py      # Fair cross-language: Sinhala 13x (Z=1.73)
python scripts/candidate_language_test.py        # Sinhala #1 by raw coverage (47.3%)
python scripts/decoder_specificity_test.py       # H12 specificity analysis
python scripts/loanword_isolation_test.py        # Sinhala vs Pali discrimination

# Run literature-derived analyses
python scripts/sov_syntax_test.py                 # SOV word order (Z=8.10 postposition, Z=5.07 NbV)
python scripts/collocation_test.py               # Pharmaceutical collocations (16/36, H12 4.8x random)
python scripts/keyword_section_clustering.py     # Keyword-section clustering (Z=30.30)
python scripts/entropy_analysis.py               # Entropy h2 and directionality

# Run methodological robustness checks
python scripts/dual_null_model_test.py           # Dual null model (constrained + unconstrained, 200 trials each)
python scripts/stability_robustness_test.py      # Stability checks (bootstrap, pruning, sections, tokenization)
python scripts/multiple_testing_correction.py    # Multiple-testing correction (Bonferroni, Holm, BH-FDR)
python scripts/holdout_validation.py             # Holdout validation (3/3 PASS, odd/even folio split)

# Or run everything at once with checksums:
./run_all.sh                                      # Full rebuild + SHA256 checksums

The H12 Decoder

The H12 decoder maps EVA (Extensible Voynich Alphabet) characters to Sinhala phonemes via abugida rules. The system uses a 14-phoneme inventory that corresponds to the spoken Elu dialect of medieval Sinhala.

Key mappings include:

sh maps to m
o maps to u
d (onset) maps to g
k (medial) maps to g
ch + consonant triggers devoicing
ct maps to th (aspirated dental)
ck maps to kh (aspirated velar)
cp maps to ph (aspirated labial)
q / h are silent (null carriers)

The decoder applies positional rules (onset vs. medial vs. coda) and abugida vowel-handling conventions to produce phonetic Sinhala output from EVA input.

Sourced Dictionary Pipeline

The decoder's meaning dictionary is fully externally sourced -- no AI-training-derived meanings are used.

Primary source: data/sinhala_english_dictionary_sourced.tsv -- 174 entries verified against published dictionaries (OSDB, Clough 1892, Monier-Williams, Digital Pali Dictionary, PTS Pali-English Dictionary)
Secondary source: data/decoded_vocabulary.tsv -- 8,493 entries from compound splits and edit-distance-1 matching against the Sinhala dictionary
Full audit trail: data/SOURCING_REPORT.md documents the verification status and external source for every entry

Each entry in the sourced dictionary carries a verification status (VERIFIED, VERIFIED-VERB, VERIFIED-COMPOUND, VERIFIED-ETYM, PARTIAL, or HYPOTHESIZED) and a citation to the specific dictionary and line/entry that confirms it.

The Key Insight: Spoken, Not Written

The manuscript encodes spoken language -- a phonetic transcription of medieval speech -- not written language in cipher. This explains why:

The 14-phoneme inventory matches pre-12th-century spoken Elu (no /b/, /v/, /f/)
N-gram frequencies match spoken Sinhala (#1 when spoken-weighted)
Edit-distance-1 matches are pronunciation variants, not errors
Compounds run together as pronounced (reflecting continuous speech)
Word-length distribution matches agglutinative spoken language

This "spoken, not written" framing resolves decades of failed cryptographic attacks: the manuscript was never a cipher, and classical decryption methods were searching for the wrong kind of structure.

Statistical Validation

Every test below can be reproduced by running the listed script. All Z-scores are computed against random decoders using the same architecture with randomized consonant mappings.

Adversarial Random-Decoder Tests

Test	Script	Result	Significance
Panchavidha Kashaya Kalpana	`structural_multilang_test.py`	6 decoded dosage form terms, 1,560 tokens, 1/200 random match	Z = 7.2 (CIRCULAR — see disclosures)
SOV syntax (postpositional)	`sov_syntax_test.py`	77.1% postposition-after-noun	Z = 8.10
SOV syntax (noun-before-verb)	`sov_syntax_test.py`	66.2% noun-before-verb	Z = 5.07
External pharma vocabulary	`validate_external_pharma.py`	7,130 tokens from 150 published terms, 0/38 beat H12 (controlled)	Z = 3.4
Pharmaceutical collocations	`collocation_test.py`	16/36 Ayurvedic pairs, H12=16 vs random avg=3.3	H12 4.8x random
Fair cross-language pharma	`structural_multilang_test.py`	Sinhala 5,021 tokens, 13x next-best (equalized 115-concept vocab)	Z = 1.73
Candidate language comparison	`candidate_language_test.py`	Sinhala #1 by raw coverage (47.3%, 16,977 tokens); all Z-scores negative	Best raw coverage
Loanword isolation	`loanword_isolation_test.py`	Sinhala-only Z=1.39, Pali at 25% of Sinhala (consistent with loanwords)	Weak pass
Keyword-section clustering	`keyword_section_clustering.py`	Semantic profiles differ by section (chi2=2015, 1000 shuffles)	Z = 30.30

Methodological Robustness

Test	Script	Result
Dual null model	`dual_null_model_test.py`	3/5 tests significant under both constrained and unconstrained nulls. Pharma Z=2.31/73.81, Collocations Z=5.32/inf, Clustering Z=3.40/116.95. SOV not significant under random decoders (Z=0.85/1.38).
Stability checks	`stability_robustness_test.py`	2/3 claims ROBUST (external pharma, keyword clustering). SOV FRAGILE: NbV drops from 63% to 44% under top-10 vocab pruning.
Multiple-testing correction	`multiple_testing_correction.py`	3/3 primary tests survive Bonferroni (α/3=0.017). SOV reclassified as conditional corroboration (fails dual-null standalone). Even correcting all 8 quantitative tests: 5/8 Bonferroni, 6/8 Holm-Bonferroni, 7/8 BH-FDR.
Holdout validation	`holdout_validation.py`	3/3 holdout tests PASS (odd/even folio split). Pharma vocab Z=19.7 (14.8× random), collocations Z=21.0 (154/160 transfer), section prediction Z=4.3 (43.5% vs 14.2% chance). Fisher's method combined p = 4.5 × 10⁻¹⁰.

Structural Validation

Test	Script	Result
Coverage	`validate_coverage.py`	94.4% glossed, 99.7% total known, 0.3% unknown
Vowel-final constraint	`validate_vowel_final.py`	99.73% vowel-final (Elu abugida)
Domain clustering	`validate_domain_clustering.py`	46.1% medical vocabulary (9x over ~5% random baseline)
Phonotactics	`validate_phonotactics.py`	Sinhala #2-3 across CV-pattern measures
Decoder specificity	`decoder_specificity_test.py`	Sinhala #1 by raw coverage (+10,201 tokens); all delta Z-scores negative (-0.82)
Directionality flip	`entropy_analysis.py`	EVA RTL-optimized (PP ratio 0.899), decoded LTR-optimized (1.804)
Entropy h2	`entropy_analysis.py`	EVA h2=2.358, decoded h2=2.339 (delta -0.019, not closer to natural)

Honest Disclosures

Finding	Details
All dictionary Z-scores negative	H12 matches fewer dictionary words than random decoders for ALL languages. Sinhala delta Z = -0.82 (Hebrew is least negative at -0.73 but matches 0 tokens). Random decoders explore more of the dictionary space.
Cross-language Z below 2.0	Fair test Z=1.73. Raw advantage (13x) is clear but not statistically significant by this metric alone.
Generic CV hypothesis	Cannot be fully rejected by dictionary matching alone. The structural tests (Panchavidha, SOV, Collocations) are the discriminating evidence.
Decoded entropy unchanged	H12 decoded h2=2.339 vs EVA h2=2.358 (delta -0.019). Decoding does not move entropy closer to natural language (~3.3 bits).

Failed Tests (Documented for Transparency)

Test	Script	Result	Why It Failed
Folio-section clustering	`folio_pharma_clustering.py`	Z = 0.2	Entire manuscript is medical — no non-medical control sections
Recipe phase ordering	`recipe_sequence_test.py`	All 3 metrics failed	Matched vocabulary too generic to reveal recipe phases

Cross-Modal Convergences (Inherently Random-Proof)

These involve two independent modalities (decoded text + illustrations) converging on the same conclusion. No random decoder can match illustrations.

Petersen botanical IDs: 3 plants independently identified by botanist and by H12 (tamala, kamala, tambula)
Rajas on women's folios: decoded "ra" (menstrual/female principle) concentrates in sections illustrating women
Recipe-illustration coherence: pharmaceutical sections with vessel illustrations decode as processing steps; herbal sections with plant illustrations decode as plant parts

Contributing / Requesting Additional Tests

We welcome adversarial testing. If you can think of a test that would falsify or strengthen this hypothesis, we want to hear it.

Ways to contribute:

Open an issue describing a test you'd like to see. We will implement and run it, and publish the results regardless of outcome (as we have with our failed tests).
Submit a PR with a new validation script. Follow the pattern of existing scripts: decode the corpus with H12, run the same test on 200 random decoders, report the Z-score.
Attempt hostile replication: clone the repo, run the scripts, report what you find.
Challenge a specific claim: if any number in this README doesn't match what you get when running the script, file a bug.

Tests we would particularly value from the community:

Independent Sinhala/Elu linguistic assessment of the decoded text
Additional cross-language discrimination with larger vocabularies
Statistical tests for hidden structure that we haven't thought of
Comparison against other proposed Voynich decipherments using the same random-decoder methodology

The full audit trail of all tests (passed and failed) is in results/structural_evidence_summary.txt and VALIDATION_LOG.md.

Citation

@article{Basra2026,
  author  = {Basra, Kameldip Singh},
  title   = {The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala},
  year    = {2026},
  doi     = {10.5281/zenodo.18644772},
  url     = {https://doi.org/10.5281/zenodo.18644772}
}

License

This work is licensed under CC BY-NC 4.0 -- free for personal and academic research use. For commercial licensing, contact kameldipbasra@gmail.com. See LICENSE file.

Contact

kameldipbasra@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

Summary

Validation Philosophy

Repository Structure

Requirements

Quick Start

The H12 Decoder

Sourced Dictionary Pipeline

The Key Insight: Spoken, Not Written

Statistical Validation

Adversarial Random-Decoder Tests

Methodological Robustness

Structural Validation

Honest Disclosures

Failed Tests (Documented for Transparency)

Cross-Modal Convergences (Inherently Random-Proof)

Contributing / Requesting Additional Tests

Citation

License

Contact

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
references		references
results		results
scripts		scripts
supplementary		supplementary
translation		translation
.gitignore		.gitignore
AUDIT_NOTES.md		AUDIT_NOTES.md
LICENSE		LICENSE
README.md		README.md
SESSION_NOTES_v8.md		SESSION_NOTES_v8.md
SESSION_NOTES_v9.md		SESSION_NOTES_v9.md
VALIDATION_LOG.md		VALIDATION_LOG.md
main.pdf		main.pdf
main.tex		main.tex
paper.md		paper.md
paper_framework.md		paper_framework.md
references.bib		references.bib
run_all.sh		run_all.sh

License

kamb-code/Voynich

Folders and files

Latest commit

History

Repository files navigation

The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

Summary

Validation Philosophy

Repository Structure

Requirements

Quick Start

The H12 Decoder

Sourced Dictionary Pipeline

The Key Insight: Spoken, Not Written

Statistical Validation

Adversarial Random-Decoder Tests

Methodological Robustness

Structural Validation

Honest Disclosures

Failed Tests (Documented for Transparency)

Cross-Modal Convergences (Inherently Random-Proof)

Contributing / Requesting Additional Tests

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages