Barnacle

Barnacle is a Python pipeline for generating high-quality, machine-readable text from scanned resources exposed via IIIF, with a focus on historically complex typography (e.g., 18th-century print features such as the long s) and standards-based interoperability.

The initial target is the Princeton University Library Digital Collections (Figgy/DPUL) Lapidus collection: https://dpul.princeton.edu/lapidus

Barnacle is being developed at Princeton University Library and is intended to be useful both internally (PUL workflows) and externally (e.g., collaborating researchers).

Status

Production-ready MVP deployed on Tufts HPC cluster.

The pipeline processes IIIF Presentation 2.1 manifests and collections, extracts page images via IIIF Image API, runs OCR with Kraken, and writes corpus-friendly JSONL outputs with full provenance tracking.

Deployment modes:

Local testing: CLI tool for single manifests
HPC production: Docker/Singularity containers with SLURM job arrays for parallel collection processing

Current capabilities:

✅ IIIF Presentation 2.1 support (manifests and collections)
✅ Kraken OCR with configurable models
✅ Resume-safe processing (skip already-processed pages)
✅ Per-manifest JSONL output with provenance
✅ Parallel processing via SLURM job arrays
✅ Containerized deployment (Docker → Singularity)

Quick Start

Installation

# Clone repository
git clone https://github.com/pulibrary/barnacle.git
cd barnacle

# Install dependencies with PDM
pdm install

# Download Kraken model
pdm run kraken get 10.5281/zenodo.10592716

Basic Usage

# Validate a manifest
pdm run barnacle validate <MANIFEST_URL>

# Run OCR on a single manifest (local testing)
pdm run barnacle ocr <MANIFEST_URL> \
    --model 10.5281/zenodo.10592716 \
    --out output.jsonl \
    --max-pages 10

# Run OCR on multiple manifests (batch processing)
pdm run barnacle run manifests.txt output/ --max-pages 5

# Sample IIIF image URL from manifest
pdm run barnacle sample-image-url <MANIFEST_URL>

HPC Deployment

For production processing on HPC clusters with SLURM:

# Build Docker image
docker build -t barnacle:latest .

# Push to DockerHub
docker push yourusername/barnacle:latest

# On HPC: Convert to Singularity
singularity pull barnacle.sif docker://yourusername/barnacle:latest

# Process collection in parallel
./slurm/run_collection.sh <COLLECTION_URL> lapidus_batch

See docs/docker.md and docs/slurm.md for detailed deployment instructions.

Motivation

PUL's repository provides OCR for many scanned volumes, but the default OCR quality is not sufficient for certain kinds of historical print (notably typographic features and ligatures). Barnacle exists to:

produce improved text output suitable for corpus linguistics and downstream NLP/ML work
preserve provenance (which image produced which text, with which model/config)
provide a clean path to attach recognized text back to IIIF resources as Web Annotations (future milestone)

Architecture

Barnacle is organized as modular Python packages with clear separation of concerns:

Core Modules

barnacle.iiif.v2 - IIIF Presentation 2.1 support
- Type-safe Pydantic models (Manifest, Collection, Canvas, ImageService)
- Loaders for files and URLs (load_manifest, load_collection)
- Validation for pipeline requirements (validate_manifest, validate_collection)
- Traversal helpers (iter_manifests for collections)
barnacle.ocr - OCR engine abstraction
- KrakenBackend with configurable models
- Automatic model installation from DOIs
- Support for custom model paths
barnacle.pipeline - Processing logic
- coordinator.py: Collection parsing, manifest list generation
- worker.py: Single manifest processing (SLURM-compatible)
- output.py: SHA1-based output paths, resume tracking
barnacle.cli - Command-line interface
- validate: Validate manifests/collections
- ocr: Run OCR locally or on single manifests
- run: Process multiple manifests from a list file
- sample-image-url: Extract IIIF image URLs

Deployment Architecture

Collection URL or CSV file
    ↓
Coordinator: Parse collection/CSV → manifest list (TSV)
    ↓
SLURM Job Array: N parallel workers
    ├─ Worker 1: process_manifest() → SHA1_1.jsonl
    ├─ Worker 2: process_manifest() → SHA1_2.jsonl
    └─ Worker N: process_manifest() → SHA1_N.jsonl

Input sources:

IIIF Collection URL: Automatically traverses and extracts manifest URLs
CSV file (--csv flag): Reads pre-extracted manifest URLs from manifest_url column

Key design decisions:

Per-manifest output files (SHA1-named): Enables parallel processing without file contention
Resume safety: Page-level tracking via page_key allows interrupted jobs to resume
SLURM-native: Uses job arrays for cluster-native parallelism and fault tolerance
Containerized: Docker images converted to Singularity for HPC deployment

Output Format

Barnacle writes JSONL files (one record per page) with comprehensive provenance:

{
  "created_at": "2026-01-22T12:34:56.789Z",
  "page_key": "https://...|canvas_id|model|jpg|!3000,3000|default|full|0",
  "canvas_index": 0,
  "engine": "kraken",
  "model": {"ref": "10.5281/zenodo.10592716", "resolved": "/path/to/model.mlmodel"},
  "manifest_url": "https://example.org/manifest",
  "canvas_id": "https://example.org/canvas/1",
  "image_url": "https://iiif.example.org/image1/full/!3000,3000/0/default.jpg",
  "elapsed_ms": 1234,
  "text": "Recognized text content...",
  "source_metadata_id": "optional_csv_field",
  "ark": "optional_ark_identifier"
}

Key fields:

page_key: Stable identifier for resume/deduplication (manifest + canvas + model + IIIF params)
text: Full OCR text for the page
model: Both user-provided reference (DOI) and resolved path
elapsed_ms: Processing time for performance tracking

Output file naming:

Single manifest: User-specified path (e.g., output.jsonl)
Collection processing: SHA1-based per-manifest files (e.g., <sha1_of_manifest_url>.jsonl)

Documentation

Docker Deployment Guide: Building containers, testing locally, pushing to DockerHub
SLURM Deployment Guide: HPC cluster deployment, job arrays, monitoring
Batch Processing Guide: GNU Parallel for non-HPC environments (VMs, workstations)
Deployment Plan: Full architecture design and production roadmap
Tests: Comprehensive test suite with fixtures

Standards and Dependencies

Barnacle integrates with:

IIIF Presentation API 2.1: Manifest/Collection parsing and validation https://iiif.io/api/presentation/2.1/
IIIF Image API 2.1: Image URL construction and parameter handling https://iiif.io/api/image/2.1/
Kraken: OCR/ATR engine with model management https://kraken.re/
Pydantic: Type-safe data models and validation https://docs.pydantic.dev/
PDM: Python package and dependency management https://pdm-project.org/

Planned:

IIIF Presentation API 3.0 support
Web Annotation output format
Additional OCR engines (Tesseract, etc.)

Development Setup

Prerequisites

Python 3.12+
PDM (Python package manager)
libvips (for Kraken image processing)

Setup

# Install PDM
pip install pdm

# Clone repository
git clone https://github.com/pulibrary/barnacle.git
cd barnacle

# Install dependencies
pdm install

# Install Kraken model
pdm run kraken get 10.5281/zenodo.10592716

Running Tests

# Run all tests
pdm run pytest

# Run with coverage
pdm run pytest --cov=barnacle --cov-report=html

# Run specific test file
pdm run pytest tests/test_iiif_models.py

Using just

Barnacle includes a justfile with common workflow commands. Install just and run just to see available recipes:

# Install just (macOS)
brew install just

# Install just (Ubuntu/Debian)
sudo apt install just

# List all available commands
just

# Common commands
just test           # Run tests
just lint           # Run linter
just check          # Run lint + tests
just ocr-smoke      # Quick OCR test (2 pages)
just run manifests.txt output/  # Batch process manifests

Run just --list to see all available recipes with descriptions.

Code Structure

barnacle/
├── src/barnacle/
│   ├── iiif/v2/          # IIIF Presentation 2.1 models
│   ├── pipeline/         # Processing logic
│   ├── ocr.py            # OCR engine abstraction
│   └── cli.py            # Command-line interface
├── tests/                # Test suite with fixtures
├── slurm/                # SLURM job scripts
├── scripts/              # Utility scripts
├── docs/                 # Documentation
├── Dockerfile            # Container definition
└── pyproject.toml        # Project configuration

Contributing

Contributions are welcome! Please:

Open an issue to discuss new features or report bugs
Include context: For bugs, provide manifest URLs or test cases; for features, explain the use case
Write tests: All new features should include tests in tests/
Follow code style: Use type hints and docstrings
Update documentation: Update relevant docs in docs/ if needed

Reporting Issues

Bugs: Include manifest URL, command used, error output, and expected behavior
Feature requests: Describe the use case and provide examples (manifest snippets, expected output format, etc.)
HPC deployment issues: Include cluster details (SLURM version, Singularity version, storage configuration)

Roadmap

Future enhancements under consideration:

IIIF Presentation 3.0 support
Web Annotation output format for attaching OCR to IIIF Canvases
Additional OCR engines: Tesseract, custom engines
Post-processing: Text normalization, dehyphenation, ligature expansion
Quality metrics: Confidence scores, validation against ground truth
API server: RESTful API for on-demand OCR
Cloud deployment: Kubernetes/container orchestration beyond SLURM

See GitHub Issues for active discussions.

Acknowledgments

Tufts HPC cluster team for deployment support and infrastructure
Kraken developers for the OCR engine and model ecosystem
IIIF community for standards and best practices
McCATMuS project for historical print recognition models

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Barnacle

Status

Quick Start

Installation

Basic Usage

HPC Deployment

Motivation

Architecture

Core Modules

Deployment Architecture

Output Format

Documentation

Standards and Dependencies

Development Setup

Prerequisites

Setup

Running Tests

Using just

Code Structure

Contributing

Reporting Issues

Roadmap

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github		.github
data		data
docker_test		docker_test
docs		docs
models		models
scripts		scripts
slurm		slurm
src/barnacle		src/barnacle
tests		tests
tools/patches		tools/patches
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
justfile		justfile
manifests_smoke.txt		manifests_smoke.txt
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

License

pulibrary/barnacle

Folders and files

Latest commit

History

Repository files navigation

Barnacle

Status

Quick Start

Installation

Basic Usage

HPC Deployment

Motivation

Architecture

Core Modules

Deployment Architecture

Output Format

Documentation

Standards and Dependencies

Development Setup

Prerequisites

Setup

Running Tests

Using just

Code Structure

Contributing

Reporting Issues

Roadmap

Acknowledgments

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages