Skip to content

Conversation

@mapo80
Copy link
Owner

@mapo80 mapo80 commented Aug 21, 2025

Summary

  • add detailed OCR options with DPI, OEM/PSM, threading and color depth defaults
  • rasterize images to 300 DPI grayscale, deskew only above 2°, and pass DPI metadata to Tesseract
  • default OcrBench extract refreshes markitdownnet, logs per-file OCR settings, and compare gates quality metrics
  • regenerate dataset/validation/_ocr/markitdownnet outputs and benchmark artifacts

Testing

  • dotnet test
  • dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
  • dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md (fails: Token-F1 0.6954 < 0.80 or line_F1 0.3794 < 0.50)

https://chatgpt.com/codex/tasks/task_e_68a786d928ac8325bb19b5e3bce40bff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants