This guide covers running Barnacle batch processing on Ubuntu VMs or similar environments using GNU Parallel as an alternative to SLURM.
For environments without HPC job schedulers (SLURM, PBS, etc.), GNU Parallel provides a simple way to process IIIF collections in parallel. The scripts/batch_process.sh wrapper handles:
- Parallel execution with configurable worker count
- Job logging for progress tracking
- Resume support for interrupted runs
- Optional tmux session management for long-running jobs
- Graceful Ctrl+C shutdown
Ubuntu/Debian:
sudo apt install parallelmacOS:
brew install parallelRHEL/CentOS/Fedora:
sudo dnf install parallelAfter first run, you may see a citation notice. Run this to suppress it:
parallel --citationFollow the main README installation instructions:
git clone https://github.com/pulibrary/barnacle.git
cd barnacle
pdm install
pdm run kraken get 10.5281/zenodo.10592716For running jobs that may take hours or days:
sudo apt install tmuxThe simplest way to batch process manifests is with the barnacle run command:
# Process all manifests in a list file
pdm run barnacle run manifests.txt output/
# Limit pages per manifest (useful for testing)
pdm run barnacle run manifests.txt output/ --max-pages 5Features:
- Automatically names output files using SHA1 hash of manifest URL
- Resume-safe: skips manifests that already have output files
- Simple one-line command for most batch processing needs
When to use barnacle run vs batch_process.sh:
Use barnacle run when... |
Use batch_process.sh when... |
|---|---|
| You want the simplest option | You need parallel processing |
| Processing sequentially is fine | You have many manifests (100+) |
| You don't need detailed job logs | You need resume from job logs |
| Quick testing or small batches | Production runs on VMs |
For parallel processing with more control, continue with batch_process.sh below.
The canonical manifest lists are stored in data/manifests/:
data/manifests/
├── all.txt # Complete list (2,853 URLs)
├── tranche-01.txt # URLs 1-500 (HPC)
├── tranche-02.txt # URLs 501-1000 (HPC)
├── tranche-03.txt # URLs 1001-1500 (HPC)
├── tranche-04.txt # URLs 1501-2000 (HPC)
├── tranche-05.txt # URLs 2001-2500 (HPC)
└── tranche-06.txt # URLs 2501-2853 (VM)To regenerate the full list from source CSV:
python scripts/prepare_manifests.py data/lapidus_lar.csv -o data/manifests/all.txtThis validates each URL and writes valid manifest URLs to the output file (one per line).
# Process a single tranche
./scripts/batch_process.sh \
--manifest-list data/manifests/tranche-01.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716
# Or process the full list
./scripts/batch_process.sh \
--manifest-list data/manifests/all.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716This will:
- Process manifests in parallel (default: half of available CPUs)
- Show progress as jobs complete
- Write a job log (e.g.,
batch_20260122_143052.log) - Create JSONL output files in the output directory (named by SHA1 hash of manifest URL)
# Clone and install
git clone https://github.com/pulibrary/barnacle.git
cd barnacle
pdm install
# Download the OCR model
pdm run kraken get 10.5281/zenodo.10592716
# Create output directory
mkdir -p output# From CSV file
python scripts/prepare_manifests.py data/lapidus_lar.csv -o manifests.txt
# Check the generated list
head manifests.txt
wc -l manifests.txtBefore processing the full collection, test with a few manifests:
# Create test subset
head -5 manifests.txt > test_manifests.txt
# Run test batch
./scripts/batch_process.sh \
--manifest-list test_manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--jobs 2
# Verify outputs
ls -la output/*.jsonl./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--jobs 8For jobs that may run for hours or days, use one of these approaches to prevent disconnection issues.
# Start in tmux session
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--tmux
# Detach from session: Ctrl+B, then D
# Reattach later:
tmux attach -t barnacle# Start new tmux session
tmux new -s barnacle
# Run the batch process
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716
# Detach: Ctrl+B, then D
# Reattach: tmux attach -t barnacle# Start screen session
screen -S barnacle
# Run the batch process
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716
# Detach: Ctrl+A, then D
# Reattach: screen -r barnacle# Run in background with output logging
nohup ./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
> batch_output.log 2>&1 &
# Check progress
tail -f batch_output.logGNU Parallel shows a progress bar by default:
Computers / CPU cores / Max jobs to run
1:local / 8 / 4
Computer:jobs running/jobs completed/%teleocal%/ETA:
1:4/23/65%/0:05:32
The job log tracks every job's status:
# View job log
cat batch_20260122_143052.log
# Count completed jobs
grep -c "^1" batch_20260122_143052.log
# Find failed jobs
awk '$7 != 0' batch_20260122_143052.logJob log columns:
- Sequence number
- Host
- Start time (epoch)
- Run time (seconds)
- Transfer time
- Bytes transferred
- Exit code (0 = success)
- Signal
- Command
# Count completed JSONL files
ls output/*.jsonl | wc -l
# Check total lines (pages processed)
wc -l output/*.jsonlIf processing is interrupted (Ctrl+C, system reboot, etc.), resume with:
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--resume \
--joblog batch_20260122_143052.logThe --resume flag tells GNU Parallel to:
- Read the existing job log
- Skip successfully completed jobs
- Retry failed jobs
- Continue with unprocessed jobs
Note: Barnacle's --resume flag (enabled by default) also handles page-level resume within a manifest, so interrupted manifests will resume from the last completed page.
GNU Parallel is not installed:
sudo apt install parallelBarnacle is not in PATH. Either:
# Activate virtual environment
source .venv/bin/activate
./scripts/batch_process.sh ...
# Or use pdm run
pdm run ./scripts/batch_process.sh ...Reduce parallelism to lower memory pressure:
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--jobs 2IIIF servers may rate-limit requests. Reduce parallelism:
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--jobs 4# Find failed jobs in log
awk '$7 != 0 {print $0}' batch_20260122_143052.log
# Get manifest URLs that failed
awk -F'\t' '$7 != 0 {print $9}' batch_20260122_143052.log
# Resume just the failed jobs
./scripts/batch_process.sh \
--manifest-list manifests.txt \
--output-dir ./output \
--model 10.5281/zenodo.10592716 \
--resume \
--joblog batch_20260122_143052.log# Attach to existing session
tmux attach -t barnacle
# Or kill it and start fresh
tmux kill-session -t barnacle./scripts/batch_process.sh --help
Usage:
./scripts/batch_process.sh --manifest-list <FILE> --output-dir <DIR> --model <MODEL> [OPTIONS]
Required Arguments:
--manifest-list <FILE> Path to manifest list file (one URL per line)
--output-dir <DIR> Directory for output JSONL files
--model <MODEL> Kraken model reference (DOI or path)
Options:
--jobs <N> Number of parallel workers (default: nproc/2)
--joblog <FILE> Path to job log file (default: batch_YYYYMMDD_HHMMSS.log)
--resume Resume from previous joblog (use with --joblog)
--tmux Start processing in a new tmux session
-h, --help Show this help message
| Feature | GNU Parallel | SLURM |
|---|---|---|
| Environment | Any Linux/macOS | HPC cluster with SLURM |
| Setup | apt install parallel |
Cluster configuration |
| Parallelism | Single machine | Multi-node cluster |
| Job Management | Job log file | SLURM job arrays |
| Resume | --resume-failed |
Resubmit failed tasks |
| Resource Limits | Manual (--jobs N) | SLURM resource allocation |
| Monitoring | Progress bar, job log | squeue, sacct |
| Best For | Small/medium batches, VMs | Large collections, HPC |
- Processing on a single VM or workstation
- Small to medium collections (hundreds of manifests)
- Environments without SLURM
- Quick testing before HPC deployment
- Large collections (thousands of manifests)
- Access to HPC cluster resources
- Need for multi-node parallelism
- Production workflows with monitoring
- SLURM Deployment Guide - HPC cluster deployment
- Docker Deployment Guide - Container builds
- Deployment Plan - Full architecture design