Skip to content

Latest commit

 

History

History
445 lines (335 loc) · 10.5 KB

File metadata and controls

445 lines (335 loc) · 10.5 KB

Batch Processing with GNU Parallel

This guide covers running Barnacle batch processing on Ubuntu VMs or similar environments using GNU Parallel as an alternative to SLURM.

Overview

For environments without HPC job schedulers (SLURM, PBS, etc.), GNU Parallel provides a simple way to process IIIF collections in parallel. The scripts/batch_process.sh wrapper handles:

  • Parallel execution with configurable worker count
  • Job logging for progress tracking
  • Resume support for interrupted runs
  • Optional tmux session management for long-running jobs
  • Graceful Ctrl+C shutdown

Prerequisites

Install GNU Parallel

Ubuntu/Debian:

sudo apt install parallel

macOS:

brew install parallel

RHEL/CentOS/Fedora:

sudo dnf install parallel

After first run, you may see a citation notice. Run this to suppress it:

parallel --citation

Install Barnacle

Follow the main README installation instructions:

git clone https://github.com/pulibrary/barnacle.git
cd barnacle
pdm install
pdm run kraken get 10.5281/zenodo.10592716

Optional: tmux for Long Jobs

For running jobs that may take hours or days:

sudo apt install tmux

Quick Start

Simple Batch Processing with barnacle run

The simplest way to batch process manifests is with the barnacle run command:

# Process all manifests in a list file
pdm run barnacle run manifests.txt output/

# Limit pages per manifest (useful for testing)
pdm run barnacle run manifests.txt output/ --max-pages 5

Features:

  • Automatically names output files using SHA1 hash of manifest URL
  • Resume-safe: skips manifests that already have output files
  • Simple one-line command for most batch processing needs

When to use barnacle run vs batch_process.sh:

Use barnacle run when... Use batch_process.sh when...
You want the simplest option You need parallel processing
Processing sequentially is fine You have many manifests (100+)
You don't need detailed job logs You need resume from job logs
Quick testing or small batches Production runs on VMs

For parallel processing with more control, continue with batch_process.sh below.


1. Prepare Manifest List

The canonical manifest lists are stored in data/manifests/:

data/manifests/
├── all.txt           # Complete list (2,853 URLs)
├── tranche-01.txt    # URLs 1-500 (HPC)
├── tranche-02.txt    # URLs 501-1000 (HPC)
├── tranche-03.txt    # URLs 1001-1500 (HPC)
├── tranche-04.txt    # URLs 1501-2000 (HPC)
├── tranche-05.txt    # URLs 2001-2500 (HPC)
└── tranche-06.txt    # URLs 2501-2853 (VM)

To regenerate the full list from source CSV:

python scripts/prepare_manifests.py data/lapidus_lar.csv -o data/manifests/all.txt

This validates each URL and writes valid manifest URLs to the output file (one per line).

2. Run Batch Processing

# Process a single tranche
./scripts/batch_process.sh \
    --manifest-list data/manifests/tranche-01.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

# Or process the full list
./scripts/batch_process.sh \
    --manifest-list data/manifests/all.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

This will:

  • Process manifests in parallel (default: half of available CPUs)
  • Show progress as jobs complete
  • Write a job log (e.g., batch_20260122_143052.log)
  • Create JSONL output files in the output directory (named by SHA1 hash of manifest URL)

Full Workflow

Step 1: Prepare Your Environment

# Clone and install
git clone https://github.com/pulibrary/barnacle.git
cd barnacle
pdm install

# Download the OCR model
pdm run kraken get 10.5281/zenodo.10592716

# Create output directory
mkdir -p output

Step 2: Generate Manifest List

# From CSV file
python scripts/prepare_manifests.py data/lapidus_lar.csv -o manifests.txt

# Check the generated list
head manifests.txt
wc -l manifests.txt

Step 3: Test with Small Batch

Before processing the full collection, test with a few manifests:

# Create test subset
head -5 manifests.txt > test_manifests.txt

# Run test batch
./scripts/batch_process.sh \
    --manifest-list test_manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 2

# Verify outputs
ls -la output/*.jsonl

Step 4: Run Full Batch

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 8

Running Long Jobs

For jobs that may run for hours or days, use one of these approaches to prevent disconnection issues.

Option 1: tmux (Recommended)

# Start in tmux session
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --tmux

# Detach from session: Ctrl+B, then D
# Reattach later:
tmux attach -t barnacle

Option 2: Manual tmux

# Start new tmux session
tmux new -s barnacle

# Run the batch process
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

# Detach: Ctrl+B, then D
# Reattach: tmux attach -t barnacle

Option 3: screen

# Start screen session
screen -S barnacle

# Run the batch process
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716

# Detach: Ctrl+A, then D
# Reattach: screen -r barnacle

Option 4: nohup

# Run in background with output logging
nohup ./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    > batch_output.log 2>&1 &

# Check progress
tail -f batch_output.log

Monitoring Progress

During Processing

GNU Parallel shows a progress bar by default:

Computers / CPU cores / Max jobs to run
1:local / 8 / 4

Computer:jobs running/jobs completed/%teleocal%/ETA:
1:4/23/65%/0:05:32

Check Job Log

The job log tracks every job's status:

# View job log
cat batch_20260122_143052.log

# Count completed jobs
grep -c "^1" batch_20260122_143052.log

# Find failed jobs
awk '$7 != 0' batch_20260122_143052.log

Job log columns:

  1. Sequence number
  2. Host
  3. Start time (epoch)
  4. Run time (seconds)
  5. Transfer time
  6. Bytes transferred
  7. Exit code (0 = success)
  8. Signal
  9. Command

Count Outputs

# Count completed JSONL files
ls output/*.jsonl | wc -l

# Check total lines (pages processed)
wc -l output/*.jsonl

Resuming Interrupted Batches

If processing is interrupted (Ctrl+C, system reboot, etc.), resume with:

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --resume \
    --joblog batch_20260122_143052.log

The --resume flag tells GNU Parallel to:

  • Read the existing job log
  • Skip successfully completed jobs
  • Retry failed jobs
  • Continue with unprocessed jobs

Note: Barnacle's --resume flag (enabled by default) also handles page-level resume within a manifest, so interrupted manifests will resume from the last completed page.

Troubleshooting

"parallel: command not found"

GNU Parallel is not installed:

sudo apt install parallel

"barnacle: command not found"

Barnacle is not in PATH. Either:

# Activate virtual environment
source .venv/bin/activate
./scripts/batch_process.sh ...

# Or use pdm run
pdm run ./scripts/batch_process.sh ...

Jobs Failing with Memory Errors

Reduce parallelism to lower memory pressure:

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 2

Jobs Failing with Network Errors

IIIF servers may rate-limit requests. Reduce parallelism:

./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --jobs 4

Checking Failed Jobs

# Find failed jobs in log
awk '$7 != 0 {print $0}' batch_20260122_143052.log

# Get manifest URLs that failed
awk -F'\t' '$7 != 0 {print $9}' batch_20260122_143052.log

# Resume just the failed jobs
./scripts/batch_process.sh \
    --manifest-list manifests.txt \
    --output-dir ./output \
    --model 10.5281/zenodo.10592716 \
    --resume \
    --joblog batch_20260122_143052.log

tmux Session Already Exists

# Attach to existing session
tmux attach -t barnacle

# Or kill it and start fresh
tmux kill-session -t barnacle

Command Reference

./scripts/batch_process.sh --help

Usage:
  ./scripts/batch_process.sh --manifest-list <FILE> --output-dir <DIR> --model <MODEL> [OPTIONS]

Required Arguments:
  --manifest-list <FILE>   Path to manifest list file (one URL per line)
  --output-dir <DIR>       Directory for output JSONL files
  --model <MODEL>          Kraken model reference (DOI or path)

Options:
  --jobs <N>               Number of parallel workers (default: nproc/2)
  --joblog <FILE>          Path to job log file (default: batch_YYYYMMDD_HHMMSS.log)
  --resume                 Resume from previous joblog (use with --joblog)
  --tmux                   Start processing in a new tmux session
  -h, --help               Show this help message

Comparison with SLURM

Feature GNU Parallel SLURM
Environment Any Linux/macOS HPC cluster with SLURM
Setup apt install parallel Cluster configuration
Parallelism Single machine Multi-node cluster
Job Management Job log file SLURM job arrays
Resume --resume-failed Resubmit failed tasks
Resource Limits Manual (--jobs N) SLURM resource allocation
Monitoring Progress bar, job log squeue, sacct
Best For Small/medium batches, VMs Large collections, HPC

When to Use GNU Parallel

  • Processing on a single VM or workstation
  • Small to medium collections (hundreds of manifests)
  • Environments without SLURM
  • Quick testing before HPC deployment

When to Use SLURM

  • Large collections (thousands of manifests)
  • Access to HPC cluster resources
  • Need for multi-node parallelism
  • Production workflows with monitoring

See Also