⚡️ Speed up method OCRAgentTesseract.extract_word_from_hocr by 35%
#2
+12
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 35% (0.35x) speedup for
OCRAgentTesseract.extract_word_from_hocrinunstructured/partition/utils/ocr_models/tesseract_ocr.py⏱️ Runtime :
7.18 milliseconds→5.31 milliseconds(best of13runs)📝 Explanation and details
The optimized code achieves a 35% speedup through two key performance improvements:
1. Regex Precompilation
The original code calls
re.search(r"x_conf (\d+\.\d+)", char_title)inside the loop, recompiling the regex pattern on every iteration. The optimization moves this to module level as_RE_X_CONF = re.compile(r"x_conf (\d+\.\d+)"), compiling it once at import time. The line profiler shows the regex search time improved from 12.73ms (42.9% of total time) to 3.02ms (16.2% of total time) - a 76% reduction in regex overhead.2. Efficient String Building
The original code uses string concatenation (
word_text += char) which creates a new string object each time due to Python's immutable strings. With 6,339 character additions in the profiled run, this becomes expensive. The optimization collects characters in a list (chars.append(char)) and builds the final string once with"".join(chars). This reduces the character accumulation overhead from 1.52ms to 1.58ms for appends plus a single 46μs join operation.Performance Impact
These optimizations are particularly effective for OCR processing where:
The 35% speedup directly translates to faster document processing in OCR workflows, with the most significant gains occurring when processing documents with many detected characters that pass the confidence threshold.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
partition/pdf_image/test_ocr.py::test_extract_word_from_hocr🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-OCRAgentTesseract.extract_word_from_hocr-mjcarjk8and push.