⚡️ Speed up function is_text_element by 190%
#6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 190% (1.90x) speedup for
is_text_elementinunstructured/partition/html/transformations.py⏱️ Runtime :
6.63 milliseconds→2.28 milliseconds(best of76runs)📝 Explanation and details
The optimization achieves a 189% speedup by eliminating expensive repeated operations and leveraging Python's built-in performance characteristics.
Key optimizations:
Module-level constant definition: Moved
text_classesandtext_categoriesto module scope as tuples instead of recreating them on every function call. This eliminates ~70% of the original runtime spent reconstructing these collections (8.7ms → 0ms in line profiler).Tuple vs List optimization: Changed from lists to tuples, which are more memory-efficient and faster for
isinstance()checks since Python can optimize tuple-based type checking.Eliminated generator expressions: Replaced
any(isinstance(...) for ...)with directisinstance(ontology_element, text_classes), which is significantly faster as it avoids generator overhead and uses Python's optimized C implementation for multiple type checking.Direct indexing: Replaced
any(... for category in text_categories)with direct comparisonontology_element.elementType == text_categories[0]since there's only one category, eliminating loop overhead.Performance impact: The line profiler shows the critical path went from 52.3% + 15.9% = 68.2% of runtime in generator expressions to 67.5% + 32.5% = 100% in just two optimized operations, but with 7.8x less total time.
Hot path relevance: Based on
function_references, this function is called fromcan_unstructured_elements_be_merged()within a loop processing multiple HTML elements. The optimization will significantly speed up HTML parsing workflows where element classification happens frequently.Test case performance: All test cases show consistent 140-213% speedups, with the optimization being particularly effective for large-scale processing (800+ elements) showing ~190% improvements, making it ideal for batch document processing scenarios.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-is_text_element-mjccgnlfand push.