Refine OCR pipeline and update bench gating #26

mapo80 · 2025-08-21T21:07:04Z

Summary

expand MarkItDownOptions with detailed OCR controls (DPI, PSM, OEM, threads, deskew, color depth)
preprocess images with optional scaling, grayscale, DPI metadata and gated deskew
wire Tesseract engine to new options and thread limit
enhance OcrBench with selective refresh, per-file logging and quality gates
refresh markitdownnet OCR artifacts (eng, PSM=6, DPI=300)

dotnet test
dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md (fails: GLOBAL Token-F1 0.7036 line_F1 0.3879)

…n markitdownnet only (eng, PSM=6, DPI=300)

OCR parity: remove pre-binarization, gated deskew, DPI metadata; rege…

3219290

…n markitdownnet only (eng, PSM=6, DPI=300)

mapo80 added the codex label Aug 21, 2025 — with ChatGPT Codex Connector