OCR parity: remove pre-binarization, gated deskew, DPI metadata; regen markitdownnet only (eng, PSM=6, DPI=300) #27

mapo80 · 2025-08-21T21:10:00Z

Summary

add detailed OCR options with DPI, OEM/PSM, threading and color depth defaults
rasterize images to 300 DPI grayscale, deskew only above 2°, and pass DPI metadata to Tesseract
default OcrBench extract refreshes markitdownnet, logs per-file OCR settings, and compare gates quality metrics
regenerate dataset/validation/_ocr/markitdownnet outputs and benchmark artifacts

dotnet test
dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md (fails: Token-F1 0.6954 < 0.80 or line_F1 0.3794 < 0.50)

…n markitdownnet only (eng, PSM=6, DPI=300)

OCR parity: remove pre-binarization, gated deskew, DPI metadata; rege…

0f7f440

…n markitdownnet only (eng, PSM=6, DPI=300)

mapo80 added the codex label Aug 21, 2025 — with ChatGPT Codex Connector