Barnacle is a Python pipeline for generating high-quality, machine-readable text from scanned resources exposed via IIIF, with a focus on historically complex typography (e.g., 18th-century print features such as the long s) and standards-based interoperability.
The initial target is the Princeton University Library Digital Collections (Figgy/DPUL) Lapidus collection:
https://dpul.princeton.edu/lapidus
Barnacle is being developed at Princeton University Library and is intended to be useful both internally (PUL workflows) and externally (e.g., collaborating researchers).
Production-ready MVP deployed on Tufts HPC cluster.
The pipeline processes IIIF Presentation 2.1 manifests and collections, extracts page images via IIIF Image API, runs OCR with Kraken, and writes corpus-friendly JSONL outputs with full provenance tracking.
Deployment modes:
- Local testing: CLI tool for single manifests
- HPC production: Docker/Singularity containers with SLURM job arrays for parallel collection processing
Current capabilities:
- ✅ IIIF Presentation 2.1 support (manifests and collections)
- ✅ Kraken OCR with configurable models
- ✅ Resume-safe processing (skip already-processed pages)
- ✅ Per-manifest JSONL output with provenance
- ✅ Parallel processing via SLURM job arrays
- ✅ Containerized deployment (Docker → Singularity)
# Clone repository
git clone https://github.com/pulibrary/barnacle.git
cd barnacle
# Install dependencies with PDM
pdm install
# Download Kraken model
pdm run kraken get 10.5281/zenodo.10592716# Validate a manifest
pdm run barnacle validate <MANIFEST_URL>
# Run OCR on a single manifest (local testing)
pdm run barnacle ocr <MANIFEST_URL> \
--model 10.5281/zenodo.10592716 \
--out output.jsonl \
--max-pages 10
# Run OCR on multiple manifests (batch processing)
pdm run barnacle run manifests.txt output/ --max-pages 5
# Sample IIIF image URL from manifest
pdm run barnacle sample-image-url <MANIFEST_URL>For production processing on HPC clusters with SLURM:
# Build Docker image
docker build -t barnacle:latest .
# Push to DockerHub
docker push yourusername/barnacle:latest
# On HPC: Convert to Singularity
singularity pull barnacle.sif docker://yourusername/barnacle:latest
# Process collection in parallel
./slurm/run_collection.sh <COLLECTION_URL> lapidus_batchSee docs/docker.md and docs/slurm.md for detailed deployment instructions.
PUL's repository provides OCR for many scanned volumes, but the default OCR quality is not sufficient for certain kinds of historical print (notably typographic features and ligatures). Barnacle exists to:
- produce improved text output suitable for corpus linguistics and downstream NLP/ML work
- preserve provenance (which image produced which text, with which model/config)
- provide a clean path to attach recognized text back to IIIF resources as Web Annotations (future milestone)
Barnacle is organized as modular Python packages with clear separation of concerns:
-
barnacle.iiif.v2- IIIF Presentation 2.1 support- Type-safe Pydantic models (
Manifest,Collection,Canvas,ImageService) - Loaders for files and URLs (
load_manifest,load_collection) - Validation for pipeline requirements (
validate_manifest,validate_collection) - Traversal helpers (
iter_manifestsfor collections)
- Type-safe Pydantic models (
-
barnacle.ocr- OCR engine abstractionKrakenBackendwith configurable models- Automatic model installation from DOIs
- Support for custom model paths
-
barnacle.pipeline- Processing logiccoordinator.py: Collection parsing, manifest list generationworker.py: Single manifest processing (SLURM-compatible)output.py: SHA1-based output paths, resume tracking
-
barnacle.cli- Command-line interfacevalidate: Validate manifests/collectionsocr: Run OCR locally or on single manifestsrun: Process multiple manifests from a list filesample-image-url: Extract IIIF image URLs
Collection URL or CSV file
↓
Coordinator: Parse collection/CSV → manifest list (TSV)
↓
SLURM Job Array: N parallel workers
├─ Worker 1: process_manifest() → SHA1_1.jsonl
├─ Worker 2: process_manifest() → SHA1_2.jsonl
└─ Worker N: process_manifest() → SHA1_N.jsonl
Input sources:
- IIIF Collection URL: Automatically traverses and extracts manifest URLs
- CSV file (
--csvflag): Reads pre-extracted manifest URLs frommanifest_urlcolumn
Key design decisions:
- Per-manifest output files (SHA1-named): Enables parallel processing without file contention
- Resume safety: Page-level tracking via
page_keyallows interrupted jobs to resume - SLURM-native: Uses job arrays for cluster-native parallelism and fault tolerance
- Containerized: Docker images converted to Singularity for HPC deployment
Barnacle writes JSONL files (one record per page) with comprehensive provenance:
{
"created_at": "2026-01-22T12:34:56.789Z",
"page_key": "https://...|canvas_id|model|jpg|!3000,3000|default|full|0",
"canvas_index": 0,
"engine": "kraken",
"model": {"ref": "10.5281/zenodo.10592716", "resolved": "/path/to/model.mlmodel"},
"manifest_url": "https://example.org/manifest",
"canvas_id": "https://example.org/canvas/1",
"image_url": "https://iiif.example.org/image1/full/!3000,3000/0/default.jpg",
"elapsed_ms": 1234,
"text": "Recognized text content...",
"source_metadata_id": "optional_csv_field",
"ark": "optional_ark_identifier"
}Key fields:
page_key: Stable identifier for resume/deduplication (manifest + canvas + model + IIIF params)text: Full OCR text for the pagemodel: Both user-provided reference (DOI) and resolved pathelapsed_ms: Processing time for performance tracking
Output file naming:
- Single manifest: User-specified path (e.g.,
output.jsonl) - Collection processing: SHA1-based per-manifest files (e.g.,
<sha1_of_manifest_url>.jsonl)
- Docker Deployment Guide: Building containers, testing locally, pushing to DockerHub
- SLURM Deployment Guide: HPC cluster deployment, job arrays, monitoring
- Batch Processing Guide: GNU Parallel for non-HPC environments (VMs, workstations)
- Deployment Plan: Full architecture design and production roadmap
- Tests: Comprehensive test suite with fixtures
Barnacle integrates with:
- IIIF Presentation API 2.1: Manifest/Collection parsing and validation https://iiif.io/api/presentation/2.1/
- IIIF Image API 2.1: Image URL construction and parameter handling https://iiif.io/api/image/2.1/
- Kraken: OCR/ATR engine with model management https://kraken.re/
- Pydantic: Type-safe data models and validation https://docs.pydantic.dev/
- PDM: Python package and dependency management https://pdm-project.org/
Planned:
- IIIF Presentation API 3.0 support
- Web Annotation output format
- Additional OCR engines (Tesseract, etc.)
- Python 3.12+
- PDM (Python package manager)
- libvips (for Kraken image processing)
# Install PDM
pip install pdm
# Clone repository
git clone https://github.com/pulibrary/barnacle.git
cd barnacle
# Install dependencies
pdm install
# Install Kraken model
pdm run kraken get 10.5281/zenodo.10592716# Run all tests
pdm run pytest
# Run with coverage
pdm run pytest --cov=barnacle --cov-report=html
# Run specific test file
pdm run pytest tests/test_iiif_models.pyBarnacle includes a justfile with common workflow commands. Install just and run just to see available recipes:
# Install just (macOS)
brew install just
# Install just (Ubuntu/Debian)
sudo apt install just
# List all available commands
just
# Common commands
just test # Run tests
just lint # Run linter
just check # Run lint + tests
just ocr-smoke # Quick OCR test (2 pages)
just run manifests.txt output/ # Batch process manifestsRun just --list to see all available recipes with descriptions.
barnacle/
├── src/barnacle/
│ ├── iiif/v2/ # IIIF Presentation 2.1 models
│ ├── pipeline/ # Processing logic
│ ├── ocr.py # OCR engine abstraction
│ └── cli.py # Command-line interface
├── tests/ # Test suite with fixtures
├── slurm/ # SLURM job scripts
├── scripts/ # Utility scripts
├── docs/ # Documentation
├── Dockerfile # Container definition
└── pyproject.toml # Project configuration
Contributions are welcome! Please:
- Open an issue to discuss new features or report bugs
- Include context: For bugs, provide manifest URLs or test cases; for features, explain the use case
- Write tests: All new features should include tests in
tests/ - Follow code style: Use type hints and docstrings
- Update documentation: Update relevant docs in
docs/if needed
- Bugs: Include manifest URL, command used, error output, and expected behavior
- Feature requests: Describe the use case and provide examples (manifest snippets, expected output format, etc.)
- HPC deployment issues: Include cluster details (SLURM version, Singularity version, storage configuration)
Future enhancements under consideration:
- IIIF Presentation 3.0 support
- Web Annotation output format for attaching OCR to IIIF Canvases
- Additional OCR engines: Tesseract, custom engines
- Post-processing: Text normalization, dehyphenation, ligature expansion
- Quality metrics: Confidence scores, validation against ground truth
- API server: RESTful API for on-demand OCR
- Cloud deployment: Kubernetes/container orchestration beyond SLURM
See GitHub Issues for active discussions.
- Tufts HPC cluster team for deployment support and infrastructure
- Kraken developers for the OCR engine and model ecosystem
- IIIF community for standards and best practices
- McCATMuS project for historical print recognition models
Copyright 2024–2025 Princeton University Library Additional copyright may be held by others, as reflected in the commit log.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.