⚡️ Speed up function remove_empty_divs_from_html_content by 6%
#8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 6% (0.06x) speedup for
remove_empty_divs_from_html_contentinunstructured/partition/html/transformations.py⏱️ Runtime :
148 milliseconds→139 milliseconds(best of53runs)📝 Explanation and details
The optimized code achieves a 6% speedup by pre-filtering divs before iteration, reducing the number of attribute checks performed during the expensive
div.unwrap()operations.Key optimization: The original code checked
div.attrsfor every div in the reversed iteration (6,870 checks in the profiler), but the optimized version filters divs upfront using a list comprehension[div for div in divs if not div.attrs], then only iterates over divs that actually need unwrapping (5,922 iterations vs 6,870).Why this is faster:
div.attrsduring each iteration through all divs, we check it once during filteringPerformance characteristics from tests:
Impact on workloads: Since this function is called from
parse_html_to_ontology()in HTML parsing workflows, the 6% improvement will benefit any HTML processing pipeline that handles documents with multiple divs, especially those with a mix of empty and non-empty divs. The optimization is most valuable for large HTML documents where the filtering step saves significant redundant work.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-remove_empty_divs_from_html_content-mjccuuy3and push.