PPARSER now provides comprehensive PDF-to-Markdown conversion with both original and enhanced CLI interfaces, featuring the new enhanced architecture with improved agent management, error handling, and processing capabilities.
- Original CLI (
pparser.cli) - Stable - Enhanced CLI (
pparser.enhanced_cli) - Advanced interface with new architecture features
The enhanced system includes:
- AgentFactory: Centralized agent creation and pipeline management
- ConfigManager: Advanced configuration handling
- ErrorHandler: Comprehensive error handling with retry logic
- MemorySystem: Advanced memory management for agents
- ContentUtils: Specialized content processing utilities
- Enhanced Processors: Improved single-file and batch processing
First, ensure PPARSER is installed:
cd /home/lexo/dev/PPARSER
pip install -e .# Basic processing
python -m pparser single input.pdf -o output/
# With verbose logging
python -m pparser single input.pdf -o output/ --verbose
# Without quality validation (faster)
python -m pparser single input.pdf -o output/ --no-quality-check
# Custom configuration
python -m pparser single input.pdf -o output/ --config config.json# Process all PDFs in a directory
python -m pparser batch input_directory/ -o output_directory/
# With custom workers and pattern
python -m pparser batch input_directory/ -o output_directory/ --workers 8 --pattern "*.pdf"
# Recursive processing with retry
python -m pparser batch input_directory/ -o output_directory/ --workers 4 --retry 2
# Skip subdirectories
python -m pparser batch input_directory/ -o output_directory/ --no-recursive
#### 4. System Status and Information
```bash
# Check system status
python -m pparser status
# Check workflow visualization
python -m pparser workflowThe enhanced CLI provides advanced features and improved architecture:
- Pipeline Selection: Choose optimized processing strategies
- Enhanced Error Handling: Automatic retry with intelligent backoff
- Agent Management: Monitor and control individual agents
- Quality Validation: Multi-dimensional quality scoring
- Advanced Reporting: Comprehensive processing statistics
# Academic pipeline for research papers
python pparser/enhanced_cli.py single paper.pdf -o output/ --pipeline academic
# Technical pipeline for manuals
python pparser/enhanced_cli.py single manual.pdf -o output/ --pipeline technical
# Fast pipeline for quick conversion
python pparser/enhanced_cli.py single document.pdf -o output/ --pipeline fast
# With comprehensive validation
python pparser/enhanced_cli.py single file.pdf -o output/ --validate --enhance --agent-memory# Batch with retry and quality validation
python pparser/enhanced_cli.py batch input/ -o output/ --workers 8 --retry 2
# Academic pipeline for research papers
python pparser/enhanced_cli.py batch papers/ -o output/ --pipeline academic --validate
# High-performance processing
python pparser/enhanced_cli.py batch input/ -o output/ --workers 16 --pipeline fast --no-quality-check# System status with detailed information
python pparser/enhanced_cli.py status --detailed
# Agent management
python pparser/enhanced_cli.py agents list
python pparser/enhanced_cli.py agents test
python pparser/enhanced_cli.py agents inspect formula
# Configuration management
python pparser/enhanced_cli.py configure show
python pparser/enhanced_cli.py configure set temperature 0.5-
Standard Pipeline (
--pipeline standard)- Balanced speed and quality
- Good for general documents
- Default option
-
Academic Pipeline (
--pipeline academic)- Optimized for research papers
- Enhanced formula and table processing
- Higher quality validation
-
Technical Pipeline (
--pipeline technical)- Optimized for technical documentation
- Enhanced structure detection
- Comprehensive asset management
-
Fast Pipeline (
--pipeline fast)- Prioritizes speed over quality
- Reduced validation steps
- Quick turnaround
| Option | Description | Default |
|---|---|---|
--verbose, -v |
Enable verbose logging | False |
--config |
Path to custom config file | None |
--log-file |
Log file path | Console only |
--no-quality-check |
Disable quality validation | False |
--no-metadata |
Don't include metadata | False |
--workers |
Number of concurrent workers | 4 |
--pattern |
File pattern to match | "*.pdf" |
--retry |
Number of retry attempts | 0 |
The enhanced CLI provides advanced features and better architecture:
# Standard pipeline
python -m pparser.enhanced_cli single input.pdf -o output/
# Academic pipeline (optimized for research papers)
python -m pparser.enhanced_cli single research_paper.pdf -o output/ --pipeline academic
# Technical pipeline (optimized for technical documents)
python -m pparser.enhanced_cli single manual.pdf -o output/ --pipeline technical
# Fast pipeline (speed over quality)
python -m pparser.enhanced_cli single document.pdf -o output/ --pipeline fast# Basic batch with enhanced features
python -m pparser.enhanced_cli batch input/ -o output/
# With comprehensive validation and enhancement
python -m pparser.enhanced_cli batch input/ -o output/ --validate --enhance
# Academic pipeline for research papers
python -m pparser.enhanced_cli batch papers/ -o output/ --pipeline academic --workers 6# Check system status
python -m pparser.enhanced_cli status
# Detailed system information
python -m pparser.enhanced_cli status --detailed
# Test agent pipeline
python -m pparser.enhanced_cli test-pipeline --pipeline academic
# Configuration management
python -m pparser.enhanced_cli config show
python -m pparser.enhanced_cli config set --key max_workers --value 8| Feature | Description |
|---|---|
| Pipeline Selection | Choose optimized processing pipelines |
| Quality Validation | Enhanced content validation and scoring |
| Retry Logic | Intelligent retry with exponential backoff |
| Agent Management | Monitor and control individual agents |
| Configuration Management | Dynamic configuration updates |
| Performance Metrics | Detailed processing statistics |
| Error Recovery | Advanced error handling and recovery |
-
Standard Pipeline (
--pipeline standard)- Balanced speed and quality
- Good for general documents
- Default option
-
Academic Pipeline (
--pipeline academic)- Optimized for research papers
- Enhanced formula and citation handling
- Better table and figure processing
-
Technical Pipeline (
--pipeline technical)- Optimized for technical documentation
- Enhanced code block detection
- Better diagram and flowchart handling
-
Fast Pipeline (
--pipeline fast)- Prioritizes speed over quality
- Basic validation only
- Good for bulk processing
# Process academic papers with enhanced quality
python -m pparser.enhanced_cli batch research_papers/ -o output/ \
--pipeline academic \
--validate \
--enhance \
--workers 4 \
--retry 1# Fast processing of many documents
python -m pparser.enhanced_cli batch documents/ -o output/ \
--pipeline fast \
--workers 8 \
--no-validation# Maximum quality processing
python -m pparser.enhanced_cli single important_document.pdf -o output/ \
--pipeline academic \
--validate \
--enhance \
--verboseBoth CLIs produce the same output structure:
output/
├── document_name.md # Main markdown file
├── images/ # Extracted images
│ ├── page_1_img_1.png
│ └── page_2_img_1.png
├── tables/ # Extracted tables
│ ├── page_3_table_1.csv
│ └── page_4_table_1.csv
└── metadata.json # Processing metadata
Create a config.json file for custom settings:
{
"max_workers": 6,
"quality_threshold": 0.8,
"enable_ocr": true,
"output_format": "markdown",
"image_extraction": {
"enabled": true,
"min_size": 100,
"formats": ["png", "jpg"]
},
"table_extraction": {
"enabled": true,
"detect_headers": true,
"output_format": "csv"
}
}-
Import Errors
# Ensure package is installed pip install -e . # Check Python path python -c "import pparser; print(pparser.__file__)"
-
Memory Issues
# Reduce workers python -m pparser batch input/ -o output/ --workers 2 # Use fast pipeline python -m pparser.enhanced_cli batch input/ -o output/ --pipeline fast
-
Quality Issues
# Use academic pipeline for better quality python -m pparser.enhanced_cli single document.pdf -o output/ --pipeline academic --validate
# General help
python -m pparser --help
# Command-specific help
python -m pparser single --help
python -m pparser batch --help
# Enhanced CLI help
python -m pparser.enhanced_cli --help
python -m pparser.enhanced_cli single --help- Adjust Workers: Use 1-2 workers per CPU core
- Use Fast Pipeline: For bulk processing where quality is less critical
- Disable Validation: Skip
--validatefor faster processing - Custom Configuration: Tune settings for your specific use case
- Monitor Resources: Use
--verboseto track performance
- Try processing a sample PDF with both CLIs
- Compare output quality between pipelines
- Customize configuration for your needs
- Set up batch processing workflows
- Integrate with your existing tools