PPARSER CLI Usage Guide - Complete System Integration

Overview

PPARSER now provides comprehensive PDF-to-Markdown conversion with both original and enhanced CLI interfaces, featuring the new enhanced architecture with improved agent management, error handling, and processing capabilities.

Available Interfaces

Original CLI (pparser.cli) - Stable
Enhanced CLI (pparser.enhanced_cli) - Advanced interface with new architecture features

Architecture Enhancements

The enhanced system includes:

AgentFactory: Centralized agent creation and pipeline management
ConfigManager: Advanced configuration handling
ErrorHandler: Comprehensive error handling with retry logic
MemorySystem: Advanced memory management for agents
ContentUtils: Specialized content processing utilities
Enhanced Processors: Improved single-file and batch processing

Installation and Setup

First, ensure PPARSER is installed:

cd /home/lexo/dev/PPARSER
pip install -e .

Standard CLI Usage

Basic Commands

1. Process a Single PDF File

# Basic processing
python -m pparser single input.pdf -o output/

# With verbose logging
python -m pparser single input.pdf -o output/ --verbose

# Without quality validation (faster)
python -m pparser single input.pdf -o output/ --no-quality-check

# Custom configuration
python -m pparser single input.pdf -o output/ --config config.json

2. Batch Process Multiple PDFs

# Process all PDFs in a directory
python -m pparser batch input_directory/ -o output_directory/

# With custom workers and pattern
python -m pparser batch input_directory/ -o output_directory/ --workers 8 --pattern "*.pdf"

# Recursive processing with retry
python -m pparser batch input_directory/ -o output_directory/ --workers 4 --retry 2

# Skip subdirectories
python -m pparser batch input_directory/ -o output_directory/ --no-recursive
#### 4. System Status and Information
```bash
# Check system status
python -m pparser status

# Check workflow visualization
python -m pparser workflow

Enhanced CLI Usage

The enhanced CLI provides advanced features and improved architecture:

Key Enhancements

Pipeline Selection: Choose optimized processing strategies
Enhanced Error Handling: Automatic retry with intelligent backoff
Agent Management: Monitor and control individual agents
Quality Validation: Multi-dimensional quality scoring
Advanced Reporting: Comprehensive processing statistics

Enhanced CLI Commands

1. Single File Processing with Pipelines

# Academic pipeline for research papers
python pparser/enhanced_cli.py single paper.pdf -o output/ --pipeline academic

# Technical pipeline for manuals
python pparser/enhanced_cli.py single manual.pdf -o output/ --pipeline technical

# Fast pipeline for quick conversion
python pparser/enhanced_cli.py single document.pdf -o output/ --pipeline fast

# With comprehensive validation
python pparser/enhanced_cli.py single file.pdf -o output/ --validate --enhance --agent-memory

2. Advanced Batch Processing

# Batch with retry and quality validation
python pparser/enhanced_cli.py batch input/ -o output/ --workers 8 --retry 2

# Academic pipeline for research papers
python pparser/enhanced_cli.py batch papers/ -o output/ --pipeline academic --validate

# High-performance processing
python pparser/enhanced_cli.py batch input/ -o output/ --workers 16 --pipeline fast --no-quality-check

3. System Management and Monitoring

# System status with detailed information
python pparser/enhanced_cli.py status --detailed

# Agent management
python pparser/enhanced_cli.py agents list
python pparser/enhanced_cli.py agents test
python pparser/enhanced_cli.py agents inspect formula

# Configuration management
python pparser/enhanced_cli.py configure show
python pparser/enhanced_cli.py configure set temperature 0.5

Pipeline Types

Standard Pipeline (--pipeline standard)
- Balanced speed and quality
- Good for general documents
- Default option
Academic Pipeline (--pipeline academic)
- Optimized for research papers
- Enhanced formula and table processing
- Higher quality validation
Technical Pipeline (--pipeline technical)
- Optimized for technical documentation
- Enhanced structure detection
- Comprehensive asset management
Fast Pipeline (--pipeline fast)
- Prioritizes speed over quality
- Reduced validation steps
- Quick turnaround

Standard CLI Options

Option	Description	Default
`--verbose, -v`	Enable verbose logging	False
`--config`	Path to custom config file	None
`--log-file`	Log file path	Console only
`--no-quality-check`	Disable quality validation	False
`--no-metadata`	Don't include metadata	False
`--workers`	Number of concurrent workers	4
`--pattern`	File pattern to match	"*.pdf"
`--retry`	Number of retry attempts	0

Enhanced CLI Usage

The enhanced CLI provides advanced features and better architecture:

Basic Commands

1. Single File Processing with Pipelines

# Standard pipeline
python -m pparser.enhanced_cli single input.pdf -o output/

# Academic pipeline (optimized for research papers)
python -m pparser.enhanced_cli single research_paper.pdf -o output/ --pipeline academic

# Technical pipeline (optimized for technical documents)
python -m pparser.enhanced_cli single manual.pdf -o output/ --pipeline technical

# Fast pipeline (speed over quality)
python -m pparser.enhanced_cli single document.pdf -o output/ --pipeline fast

2. Enhanced Batch Processing

# Basic batch with enhanced features
python -m pparser.enhanced_cli batch input/ -o output/

# With comprehensive validation and enhancement
python -m pparser.enhanced_cli batch input/ -o output/ --validate --enhance

# Academic pipeline for research papers
python -m pparser.enhanced_cli batch papers/ -o output/ --pipeline academic --workers 6

3. System Management

# Check system status
python -m pparser.enhanced_cli status

# Detailed system information
python -m pparser.enhanced_cli status --detailed

# Test agent pipeline
python -m pparser.enhanced_cli test-pipeline --pipeline academic

# Configuration management
python -m pparser.enhanced_cli config show
python -m pparser.enhanced_cli config set --key max_workers --value 8

Enhanced CLI Features

Feature	Description
Pipeline Selection	Choose optimized processing pipelines
Quality Validation	Enhanced content validation and scoring
Retry Logic	Intelligent retry with exponential backoff
Agent Management	Monitor and control individual agents
Configuration Management	Dynamic configuration updates
Performance Metrics	Detailed processing statistics
Error Recovery	Advanced error handling and recovery

Pipeline Types

Standard Pipeline (--pipeline standard)
- Balanced speed and quality
- Good for general documents
- Default option
Academic Pipeline (--pipeline academic)
- Optimized for research papers
- Enhanced formula and citation handling
- Better table and figure processing
Technical Pipeline (--pipeline technical)
- Optimized for technical documentation
- Enhanced code block detection
- Better diagram and flowchart handling
Fast Pipeline (--pipeline fast)
- Prioritizes speed over quality
- Basic validation only
- Good for bulk processing

Examples

Example 1: Process Research Papers

# Process academic papers with enhanced quality
python -m pparser.enhanced_cli batch research_papers/ -o output/ \
  --pipeline academic \
  --validate \
  --enhance \
  --workers 4 \
  --retry 1

Example 2: Quick Bulk Processing

# Fast processing of many documents
python -m pparser.enhanced_cli batch documents/ -o output/ \
  --pipeline fast \
  --workers 8 \
  --no-validation

Example 3: High-Quality Single Document

# Maximum quality processing
python -m pparser.enhanced_cli single important_document.pdf -o output/ \
  --pipeline academic \
  --validate \
  --enhance \
  --verbose

Output Structure

Both CLIs produce the same output structure:

output/
├── document_name.md           # Main markdown file
├── images/                    # Extracted images
│   ├── page_1_img_1.png
│   └── page_2_img_1.png
├── tables/                    # Extracted tables
│   ├── page_3_table_1.csv
│   └── page_4_table_1.csv
└── metadata.json             # Processing metadata

Configuration Files

Create a config.json file for custom settings:

{
  "max_workers": 6,
  "quality_threshold": 0.8,
  "enable_ocr": true,
  "output_format": "markdown",
  "image_extraction": {
    "enabled": true,
    "min_size": 100,
    "formats": ["png", "jpg"]
  },
  "table_extraction": {
    "enabled": true,
    "detect_headers": true,
    "output_format": "csv"
  }
}

Troubleshooting

Common Issues

Import Errors

# Ensure package is installed
pip install -e .

# Check Python path
python -c "import pparser; print(pparser.__file__)"

Memory Issues

# Reduce workers
python -m pparser batch input/ -o output/ --workers 2

# Use fast pipeline
python -m pparser.enhanced_cli batch input/ -o output/ --pipeline fast

Quality Issues

# Use academic pipeline for better quality
python -m pparser.enhanced_cli single document.pdf -o output/ --pipeline academic --validate

Getting Help

# General help
python -m pparser --help

# Command-specific help
python -m pparser single --help
python -m pparser batch --help

# Enhanced CLI help
python -m pparser.enhanced_cli --help
python -m pparser.enhanced_cli single --help

Performance Tips

Adjust Workers: Use 1-2 workers per CPU core
Use Fast Pipeline: For bulk processing where quality is less critical
Disable Validation: Skip --validate for faster processing
Custom Configuration: Tune settings for your specific use case
Monitor Resources: Use --verbose to track performance

Next Steps

Try processing a sample PDF with both CLIs
Compare output quality between pipelines
Customize configuration for your needs
Set up batch processing workflows
Integrate with your existing tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPARSER CLI Usage Guide - Complete System Integration

Overview

Available Interfaces

Architecture Enhancements

Installation and Setup

Standard CLI Usage

Basic Commands

1. Process a Single PDF File

2. Batch Process Multiple PDFs

Enhanced CLI Usage

Key Enhancements

Enhanced CLI Commands

1. Single File Processing with Pipelines

2. Advanced Batch Processing

3. System Management and Monitoring

Pipeline Types

Standard CLI Options

Enhanced CLI Usage

Basic Commands

1. Single File Processing with Pipelines

2. Enhanced Batch Processing

3. System Management

Enhanced CLI Features

Pipeline Types

Examples

Example 1: Process Research Papers

Example 2: Quick Bulk Processing

Example 3: High-Quality Single Document

Output Structure

Configuration Files

Troubleshooting

Common Issues

Getting Help

Performance Tips

Next Steps

FilesExpand file tree

CLI_USAGE_GUIDE.md

Latest commit

History

CLI_USAGE_GUIDE.md

File metadata and controls

PPARSER CLI Usage Guide - Complete System Integration

Overview

Available Interfaces

Architecture Enhancements

Installation and Setup

Standard CLI Usage

Basic Commands

1. Process a Single PDF File

2. Batch Process Multiple PDFs

Enhanced CLI Usage

Key Enhancements

Enhanced CLI Commands

1. Single File Processing with Pipelines

2. Advanced Batch Processing

3. System Management and Monitoring

Pipeline Types

Standard CLI Options

Enhanced CLI Usage

Basic Commands

1. Single File Processing with Pipelines

2. Enhanced Batch Processing

3. System Management

Enhanced CLI Features

Pipeline Types

Examples

Example 1: Process Research Papers

Example 2: Quick Bulk Processing

Example 3: High-Quality Single Document

Output Structure

Configuration Files

Troubleshooting

Common Issues

Getting Help

Performance Tips

Next Steps