Skip to content

Conversation

@mapo80
Copy link
Owner

@mapo80 mapo80 commented Aug 21, 2025

Summary

  • expose explicit OCR tuning in MarkItDownOptions (DPI, PSM, OEM, threads, force raster)
  • add Rasterizer for uniform 300 DPI preprocessing with Otsu binarisation and light deskew
  • drive Tesseract with the new options and wire benchmarks/smoke tests via OcrBench

Testing

  • dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
  • dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md

https://chatgpt.com/codex/tasks/task_e_68a77ea46bdc8325bb77f0d7313776cc

mapo80 added 3 commits August 21, 2025 22:47
…data; regen markitdownnet only (eng, PSM=6, DPI=300)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants