Skip to content

Conversation

@mapo80
Copy link
Owner

@mapo80 mapo80 commented Aug 21, 2025

Summary

  • expand MarkItDownOptions with detailed OCR controls (DPI, PSM, OEM, threads, deskew, color depth)
  • preprocess images with optional scaling, grayscale, DPI metadata and gated deskew
  • wire Tesseract engine to new options and thread limit
  • enhance OcrBench with selective refresh, per-file logging and quality gates
  • refresh markitdownnet OCR artifacts (eng, PSM=6, DPI=300)

Testing

  • dotnet test
  • dotnet run --project tools/OcrBench -- extract --input-dir dataset/validation --out-dir dataset/validation/_ocr --threads 1 --langs eng --psm 6 --refresh markitdownnet
  • dotnet run --project tools/OcrBench -- compare --ocr-dir dataset/validation/_ocr --out-json artifacts/validation/OCR/bench-ocr.json --out-md artifacts/validation/OCR/summary-ocr.md (fails: GLOBAL Token-F1 0.7036 line_F1 0.3879)

https://chatgpt.com/codex/tasks/task_e_68a786d928ac8325bb19b5e3bce40bff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants