-
Notifications
You must be signed in to change notification settings - Fork 39
Feature: Automatic Document Language Detection and Locale Parametrization #11
Description
Summary
Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.
🎯 Motivation
Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (en-US) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.
✨ Features Implemented
1. Automatic Language Detection
- AWS Comprehend Integration: Utilizes AWS Comprehend's
DetectDominantLanguageAPI to analyze document content - Smart Text Sampling: Extracts text from the first 5 pages of the PDF for language analysis
- Confidence Thresholding: Only applies detected language if confidence score ≥ 70%
- Graceful Fallbacks: Defaults to English (
en-US) for low-confidence detections or errors
2. Comprehensive Language Support
Supports 30+ languages with proper locale mapping:
| Language | AWS Code | Adobe Locale | Region |
|---|---|---|---|
| English | en |
en-US |
United States |
| Spanish | es |
es-ES |
Spain |
| Catalan | ca |
ca-ES |
Spain |
| French | fr |
fr-FR |
France |
| German | de |
de-DE |
Germany |
| Italian | it |
it-IT |
Italy |
| Portuguese | pt |
pt-BR |
Brazil |
| Japanese | ja |
ja-JP |
Japan |
| Chinese | zh |
zh-CN |
China (Simplified) |
| And 20+ more... |
3. Integrated Processing Pipeline
- Autotagging: Applies detected locale to
AutotagPDFParamsfor language-aware accessibility tagging - Text Extraction: Uses detected locale in
ExtractPDFParamsfor improved text and table extraction - PDF Metadata: Sets document language metadata consistently across the pipeline
4. Enhanced Error Handling & Logging
- Comprehensive logging of detection process and confidence scores
- Handles AWS Comprehend API limits (5000 bytes max text)
- Manages insufficient text scenarios gracefully
- Detailed error reporting for troubleshooting
🔧 Technical Implementation
Core Components Added:
1. Language Detection Function
def detect_document_language(pdf_path, filename):
"""
Detect the dominant language in a PDF document using AWS Comprehend.
Returns Adobe PDF Services locale code (e.g., 'es-ES', 'ca-ES', 'en-US')
"""2. Updated API Functions
autotag_pdf_with_options()- Now acceptsdetected_localeparameterextract_api()- Now acceptsdetected_localeparameterset_language_comprehend()- Enhanced to use detected locale for PDF metadata
3. Language-to-Locale Mapping
Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.
Infrastructure Changes:
AWS CDK Updates (app.py):
- IAM Permissions: Added
comprehend:DetectDominantLanguagepermission to ECS task role - Environment Variables: Removed hardcoded
PDF_LOCALEenvironment variable - Backward Compatibility: Maintains support for manual locale override via environment variable
📊 Processing Flow
graph TD
A[PDF Upload] --> B[Download from S3]
B --> C[Extract Text from First 5 Pages]
C --> D[AWS Comprehend Language Detection]
D --> E{Confidence ≥ 70%?}
E -->|Yes| F[Map to Adobe Locale]
E -->|No| G[Default to en-US]
F --> H[Apply Locale to Adobe APIs]
G --> H
H --> I[Autotagging with Locale]
H --> J[Text Extraction with Locale]
I --> K[Set PDF Language Metadata]
J --> K
K --> L[Upload Processed PDF]
🧪 Testing Scenarios
Test Cases to Validate:
- Spanish Documents: Verify
es-ESlocale detection and application - Catalan Documents: Verify
ca-ESlocale detection and application - Mixed Language Documents: Test confidence thresholding
- Scanned/Image PDFs: Handle insufficient text scenarios
- Very Short Documents: Test minimum text requirements
- Error Scenarios: AWS Comprehend API failures, network issues
- Backward Compatibility: Manual locale override still works
Expected Improvements:
- Better Accessibility Tagging: Language-specific heading detection and structure analysis
- Improved Text Extraction: Better handling of language-specific characters and formatting
- Enhanced Metadata: Proper language metadata in final PDF documents
- Compliance: Better WCAG 2.1 compliance for non-English documents
📈 Benefits
For Users:
- Automatic Processing: No manual language configuration required
- Better Accuracy: Language-aware processing improves accessibility tagging quality
- Multi-language Support: Seamless handling of documents in 30+ languages
- Consistent Results: Standardized locale application across all processing steps
For Developers:
- Maintainable Code: Clean separation of language detection logic
- Extensible Design: Easy to add new language mappings
- Comprehensive Logging: Detailed insights into language detection process
- Error Resilience: Robust fallback mechanisms
🔍 Monitoring & Observability
Key Metrics to Track:
- Language detection confidence scores
- Distribution of detected languages
- Fallback to default locale frequency
- AWS Comprehend API usage and costs
- Processing time impact
Log Messages Added:
Detected language: {code} (confidence: {score})Using locale for autotagging: {locale}Using locale for extraction: {locale}Language set to {code} (from detected locale: {locale})
🚀 Deployment Notes
Prerequisites:
- AWS Comprehend service availability in deployment region
- Updated IAM permissions for ECS task role
- No additional environment variables required
Rollback Plan:
- Set
PDF_LOCALEenvironment variable to force specific locale - Previous hardcoded behavior can be restored by setting
PDF_LOCALE=en-US
🔮 Future Enhancements
Potential Improvements:
- Language Detection Caching: Cache results for similar documents
- Multi-language Documents: Handle documents with multiple languages
- Custom Language Models: Support for domain-specific language detection
- User Override Interface: Allow manual language selection in frontend
- Language-specific Processing Rules: Customize processing based on detected language
- Analytics Dashboard: Visualize language distribution and processing metrics
📝 Files Modified
Core Changes:
docker_autotag/autotag.py: Added language detection and locale parametrizationapp.py: Updated IAM permissions and removed hardcoded locale
Key Functions Added/Modified:
detect_document_language()- New function for language detectionautotag_pdf_with_options()- Added locale parameterextract_api()- Added locale parameterset_language_comprehend()- Enhanced with locale supportmain()- Integrated language detection workflow
🏷️ Labels
enhancement language-support aws-comprehend adobe-pdf-services accessibility internationalization i18n
🔗 Related Issues
- Addresses need for multi-language document processing
- Improves accessibility compliance for non-English documents
- Enhances Adobe PDF Services API utilization
Priority: High
Complexity: Medium
Impact: High - Significantly improves processing quality for non-English documents