Batch Processing with GNU Parallel

This guide covers running Barnacle batch processing on Ubuntu VMs or similar environments using GNU Parallel as an alternative to SLURM.

Overview

For environments without HPC job schedulers (SLURM, PBS, etc.), GNU Parallel provides a simple way to process IIIF collections in parallel. The scripts/batch_process.sh wrapper handles:

Parallel execution with configurable worker count
Job logging for progress tracking
Resume support for interrupted runs
Optional tmux session management for long-running jobs
Graceful Ctrl+C shutdown

Prerequisites

Install GNU Parallel

Ubuntu/Debian:

sudo apt install parallel

macOS:

brew install parallel

RHEL/CentOS/Fedora:

sudo dnf install parallel

After first run, you may see a citation notice. Run this to suppress it:

parallel --citation

Install Barnacle

Follow the main README installation instructions:

git clone https://github.com/pulibrary/barnacle.git
cd barnacle
pdm install
pdm run kraken get 10.5281/zenodo.10592716

Optional: tmux for Long Jobs

For running jobs that may take hours or days:

sudo apt install tmux

Quick Start

Simple Batch Processing with `barnacle run`

The simplest way to batch process manifests is with the barnacle run command:

# Process all manifests in a list file
pdm run barnacle run manifests.txt output/

# Limit pages per manifest (useful for testing)
pdm run barnacle run manifests.txt output/ --max-pages 5

Features:

Automatically names output files using SHA1 hash of manifest URL
Resume-safe: skips manifests that already have output files
Simple one-line command for most batch processing needs

When to use barnacle run vs batch_process.sh:

Use `barnacle run` when...	Use `batch_process.sh` when...
You want the simplest option	You need parallel processing
Processing sequentially is fine	You have many manifests (100+)
You don't need detailed job logs	You need resume from job logs
Quick testing or small batches	Production runs on VMs

For parallel processing with more control, continue with batch_process.sh below.

1. Prepare Manifest List

The canonical manifest lists are stored in data/manifests/:

data/manifests/
├── all.txt           # Complete list (2,853 URLs)
├── tranche-01.txt    # URLs 1-500 (HPC)
├── tranche-02.txt    # URLs 501-1000 (HPC)
├── tranche-03.txt    # URLs 1001-1500 (HPC)
├── tranche-04.txt    # URLs 1501-2000 (HPC)
├── tranche-05.txt    # URLs 2001-2500 (HPC)
└── tranche-06.txt    # URLs 2501-2853 (VM)

To regenerate the full list from source CSV:

python scripts/prepare_manifests.py data/lapidus_lar.csv -o data/manifests/all.txt

This validates each URL and writes valid manifest URLs to the output file (one per line).

2. Run Batch Processing

# Process a single tranche
./scripts/batch_process.sh \
    --manifest-list data/manifests/tranche-01.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

# Or process the full list
./scripts/batch_process.sh \
    --manifest-list data/manifests/all.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

This will:

Process manifests in parallel (default: half of available CPUs)
Show progress as jobs complete
Write a job log (e.g., batch_20260122_143052.log)
Create JSONL output files in the output directory (named by SHA1 hash of manifest URL)

Full Workflow

Step 1: Prepare Your Environment

# Clone and install
git clone https://github.com/pulibrary/barnacle.git
cd barnacle
pdm install

# Download the OCR model
pdm run kraken get 10.5281/zenodo.10592716

# Create output directory
mkdir -p output

Step 2: Generate Manifest List

# From CSV file
python scripts/prepare_manifests.py data/lapidus_lar.csv -o manifests.txt

# Check the generated list
head manifests.txt
wc -l manifests.txt

Step 3: Test with Small Batch

Before processing the full collection, test with a few manifests:

# Create test subset
head -5 manifests.txt > test_manifests.txt

# Run test batch
./scripts/batch_process.sh \
    --manifest-list test_manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 2

# Verify outputs
ls -la output/*.jsonl

Step 4: Run Full Batch

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 8

Running Long Jobs

For jobs that may run for hours or days, use one of these approaches to prevent disconnection issues.

Option 1: tmux (Recommended)

# Start in tmux session
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --tmux

# Detach from session: Ctrl+B, then D
# Reattach later:
tmux attach -t barnacle

Option 2: Manual tmux

# Start new tmux session
tmux new -s barnacle

# Run the batch process
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

# Detach: Ctrl+B, then D
# Reattach: tmux attach -t barnacle

Option 3: screen

# Start screen session
screen -S barnacle

# Run the batch process
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

# Detach: Ctrl+A, then D
# Reattach: screen -r barnacle

Option 4: nohup

# Run in background with output logging
nohup ./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    > batch_output.log 2>&1 &

# Check progress
tail -f batch_output.log

Monitoring Progress

During Processing

GNU Parallel shows a progress bar by default:

Computers / CPU cores / Max jobs to run
1:local / 8 / 4

Computer:jobs running/jobs completed/%teleocal%/ETA:
1:4/23/65%/0:05:32

Check Job Log

The job log tracks every job's status:

# View job log
cat batch_20260122_143052.log

# Count completed jobs
grep -c "^1" batch_20260122_143052.log

# Find failed jobs
awk '$7 != 0' batch_20260122_143052.log

Job log columns:

Sequence number
Host
Start time (epoch)
Run time (seconds)
Transfer time
Bytes transferred
Exit code (0 = success)
Signal
Command

Count Outputs

# Count completed JSONL files
ls output/*.jsonl | wc -l

# Check total lines (pages processed)
wc -l output/*.jsonl

Resuming Interrupted Batches

If processing is interrupted (Ctrl+C, system reboot, etc.), resume with:

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --resume \
    --joblog batch_20260122_143052.log

The --resume flag tells GNU Parallel to:

Read the existing job log
Skip successfully completed jobs
Retry failed jobs
Continue with unprocessed jobs

Note: Barnacle's --resume flag (enabled by default) also handles page-level resume within a manifest, so interrupted manifests will resume from the last completed page.

Troubleshooting

"parallel: command not found"

GNU Parallel is not installed:

sudo apt install parallel

"barnacle: command not found"

Barnacle is not in PATH. Either:

# Activate virtual environment
source .venv/bin/activate
./scripts/batch_process.sh ...

# Or use pdm run
pdm run ./scripts/batch_process.sh ...

Jobs Failing with Memory Errors

Reduce parallelism to lower memory pressure:

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 2

Jobs Failing with Network Errors

IIIF servers may rate-limit requests. Reduce parallelism:

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 4

Checking Failed Jobs

# Find failed jobs in log
awk '$7 != 0 {print $0}' batch_20260122_143052.log

# Get manifest URLs that failed
awk -F'\t' '$7 != 0 {print $9}' batch_20260122_143052.log

# Resume just the failed jobs
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --resume \
    --joblog batch_20260122_143052.log

tmux Session Already Exists

# Attach to existing session
tmux attach -t barnacle

# Or kill it and start fresh
tmux kill-session -t barnacle

Command Reference

./scripts/batch_process.sh --help

Usage:
  ./scripts/batch_process.sh --manifest-list <FILE> --output-dir <DIR> --model <MODEL> [OPTIONS]

Required Arguments:
  --manifest-list <FILE>   Path to manifest list file (one URL per line)
  --output-dir <DIR>       Directory for output JSONL files
  --model <MODEL>          Kraken model reference (DOI or path)

Options:
  --jobs <N>               Number of parallel workers (default: nproc/2)
  --joblog <FILE>          Path to job log file (default: batch_YYYYMMDD_HHMMSS.log)
  --resume                 Resume from previous joblog (use with --joblog)
  --tmux                   Start processing in a new tmux session
  -h, --help               Show this help message

Comparison with SLURM

Feature	GNU Parallel	SLURM
Environment	Any Linux/macOS	HPC cluster with SLURM
Setup	`apt install parallel`	Cluster configuration
Parallelism	Single machine	Multi-node cluster
Job Management	Job log file	SLURM job arrays
Resume	`--resume-failed`	Resubmit failed tasks
Resource Limits	Manual (--jobs N)	SLURM resource allocation
Monitoring	Progress bar, job log	squeue, sacct
Best For	Small/medium batches, VMs	Large collections, HPC

When to Use GNU Parallel

Processing on a single VM or workstation
Small to medium collections (hundreds of manifests)
Environments without SLURM
Quick testing before HPC deployment

When to Use SLURM

Large collections (thousands of manifests)
Access to HPC cluster resources
Need for multi-node parallelism
Production workflows with monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Processing with GNU Parallel

Overview

Prerequisites

Install GNU Parallel

Install Barnacle

Optional: tmux for Long Jobs

Quick Start

Simple Batch Processing with `barnacle run`

1. Prepare Manifest List

2. Run Batch Processing

Full Workflow

Step 1: Prepare Your Environment

Step 2: Generate Manifest List

Step 3: Test with Small Batch

Step 4: Run Full Batch

Running Long Jobs

Option 1: tmux (Recommended)

Option 2: Manual tmux

Option 3: screen

Option 4: nohup

Monitoring Progress

During Processing

Check Job Log

Count Outputs

Resuming Interrupted Batches

Troubleshooting

"parallel: command not found"

"barnacle: command not found"

Jobs Failing with Memory Errors

Jobs Failing with Network Errors

Checking Failed Jobs

tmux Session Already Exists

Command Reference

Comparison with SLURM

When to Use GNU Parallel

When to Use SLURM

See Also

FilesExpand file tree

batch-processing.md

Latest commit

History

batch-processing.md

File metadata and controls

Batch Processing with GNU Parallel

Overview

Prerequisites

Install GNU Parallel

Install Barnacle

Optional: tmux for Long Jobs

Quick Start

Simple Batch Processing with barnacle run

1. Prepare Manifest List

2. Run Batch Processing

Full Workflow

Step 1: Prepare Your Environment

Step 2: Generate Manifest List

Step 3: Test with Small Batch

Step 4: Run Full Batch

Running Long Jobs

Option 1: tmux (Recommended)

Option 2: Manual tmux

Option 3: screen

Option 4: nohup

Monitoring Progress

During Processing

Check Job Log

Count Outputs

Resuming Interrupted Batches

Troubleshooting

"parallel: command not found"

"barnacle: command not found"

Jobs Failing with Memory Errors

Jobs Failing with Network Errors

Checking Failed Jobs

tmux Session Already Exists

Command Reference

Comparison with SLURM

When to Use GNU Parallel

When to Use SLURM

See Also

Simple Batch Processing with `barnacle run`