Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 19, 2025

📄 6% (0.06x) speedup for remove_empty_divs_from_html_content in unstructured/partition/html/transformations.py

⏱️ Runtime : 148 milliseconds 139 milliseconds (best of 53 runs)

📝 Explanation and details

The optimized code achieves a 6% speedup by pre-filtering divs before iteration, reducing the number of attribute checks performed during the expensive div.unwrap() operations.

Key optimization: The original code checked div.attrs for every div in the reversed iteration (6,870 checks in the profiler), but the optimized version filters divs upfront using a list comprehension [div for div in divs if not div.attrs], then only iterates over divs that actually need unwrapping (5,922 iterations vs 6,870).

Why this is faster:

  • Reduced attribute lookups: Instead of checking div.attrs during each iteration through all divs, we check it once during filtering
  • Fewer loop iterations: The main loop only processes divs that will actually be unwrapped (5,861 unwrap calls in both versions, but fewer total iterations)
  • Better cache locality: Processing a smaller, filtered list improves memory access patterns

Performance characteristics from tests:

  • Small HTML (single divs): 7-12% faster due to reduced overhead
  • Large-scale tests (1000+ divs): 4-10% faster, with the best gains on mixed content where many divs don't need unwrapping
  • Nested structures: 8-13% faster as the filtering eliminates unnecessary traversals

Impact on workloads: Since this function is called from parse_html_to_ontology() in HTML parsing workflows, the 6% improvement will benefit any HTML processing pipeline that handles documents with multiple divs, especially those with a mix of empty and non-empty divs. The optimization is most valuable for large HTML documents where the filtering step saves significant redundant work.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 61 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from bs4 import BeautifulSoup

from unstructured.partition.html.transformations import remove_empty_divs_from_html_content

# unit tests

# ------------------------------
# Basic Test Cases
# ------------------------------


def test_remove_single_empty_div():
    # Removes a single empty div
    html = "<div></div>"
    expected = ""
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 43.6μs -> 40.4μs (7.94% faster)


def test_remove_multiple_empty_divs():
    # Removes multiple empty divs
    html = "<div></div><div></div>"
    expected = ""
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 56.4μs -> 51.6μs (9.20% faster)


def test_does_not_remove_div_with_content():
    # Does not remove divs with text content
    html = "<div>hello</div>"
    expected = "<div>hello</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 59.3μs -> 53.0μs (11.9% faster)


def test_does_not_remove_div_with_child():
    # Does not remove divs with non-empty child elements
    html = "<div><span>text</span></div>"
    expected = "<div><span>text</span></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 72.5μs -> 65.6μs (10.4% faster)


def test_remove_empty_divs_nested():
    # Removes nested empty divs but keeps non-empty structure
    html = "<div><div></div><div>keep</div></div>"
    expected = "<div>keep</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 81.6μs -> 74.9μs (8.96% faster)


def test_does_not_remove_div_with_attributes():
    # Does not remove divs with any attributes, even if empty
    html = '<div id="foo"></div>'
    expected = '<div id="foo"></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 53.2μs -> 50.2μs (6.15% faster)


def test_removes_only_divs_without_attributes():
    # Only divs with no attributes are removed
    html = '<div></div><div class="a"></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 68.2μs -> 64.6μs (5.61% faster)
    soup = BeautifulSoup(result, "html.parser")
    divs = soup.find_all("div")


def test_preserves_non_div_tags():
    # Other tags are not affected
    html = "<span></span><div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 58.7μs -> 55.7μs (5.38% faster)
    soup = BeautifulSoup(result, "html.parser")


# ------------------------------
# Edge Test Cases
# ------------------------------


def test_empty_string():
    # Empty string input returns empty string
    html = ""
    expected = ""
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 28.9μs -> 28.0μs (3.28% faster)


def test_no_divs():
    # HTML with no divs is unchanged
    html = "<span>text</span>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 52.5μs -> 50.4μs (4.13% faster)


def test_div_with_whitespace_content():
    # Divs with only whitespace are not considered empty
    html = "<div>   </div>"
    # According to the implementation, whitespace is content, so div remains
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 54.3μs -> 51.8μs (4.91% faster)


def test_div_with_comment():
    # Divs with only comments are considered empty by BeautifulSoup (comments are children)
    html = "<div><!-- comment --></div>"
    # BeautifulSoup treats comment as content, so div remains
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 57.8μs -> 54.8μs (5.47% faster)


def test_nested_empty_divs():
    # Nested empty divs, all should be unwrapped
    html = "<div><div><div></div></div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 62.2μs -> 59.6μs (4.47% faster)


def test_div_with_empty_attribute():
    # Div with empty attribute should not be removed
    html = '<div id=""></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 50.7μs -> 49.5μs (2.44% faster)


def test_mixed_content_and_empty_divs():
    # Only empty divs are removed, content remains
    html = "<div></div>hello<div>world</div><div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 83.2μs -> 77.8μs (6.91% faster)


def test_div_with_nested_empty_div():
    # Only the innermost empty div is unwrapped
    html = "<div><div></div></div>"
    # After unwrapping, the outer div is empty and should also be unwrapped
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 51.9μs -> 49.5μs (4.88% faster)


def test_div_with_non_div_empty_tag():
    # Only divs are targeted, not other tags
    html = "<div></div><span></span>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 56.9μs -> 53.9μs (5.49% faster)


def test_div_with_text_and_empty_div():
    # Only the empty div is removed, text remains
    html = "<div>text</div><div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 66.3μs -> 63.4μs (4.54% faster)


# ------------------------------
# Large Scale Test Cases
# ------------------------------


def test_large_number_of_empty_divs():
    # Test performance and correctness with 1000 empty divs
    html = "<div></div>" * 1000
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 19.9ms -> 19.1ms (4.43% faster)


def test_large_mixed_divs():
    # 500 empty divs, 500 non-empty divs
    html = "<div></div>" * 500 + "".join(f"<div>{i}</div>" for i in range(500))
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 22.6ms -> 22.2ms (1.81% faster)
    # Only non-empty divs remain
    for i in range(500):
        pass


def test_large_nested_empty_divs():
    # Deeply nested empty divs, all should be unwrapped
    html = "".join("<div>" for _ in range(100)) + "".join("</div>" for _ in range(100))
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 939μs -> 858μs (9.46% faster)


def test_large_html_with_attributes():
    # Large structure with attributes, none should be removed
    html = "".join(f'<div id="d{i}"></div>' for i in range(1000))
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 13.5ms -> 13.0ms (3.43% faster)
    soup = BeautifulSoup(result, "html.parser")
    divs = soup.find_all("div")
    for i in range(1000):
        pass


def test_large_html_with_mixed_tags():
    # Large HTML with many divs and other tags
    html = "<div></div>" * 500 + "<span></span>" * 500
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 12.6ms -> 11.8ms (7.29% faster)
    soup = BeautifulSoup(result, "html.parser")


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# function to test
from unstructured.partition.html.transformations import remove_empty_divs_from_html_content

# unit tests

# --------------------------
# BASIC TEST CASES
# --------------------------


def test_remove_single_empty_div():
    # Single empty div should be removed
    html = "<div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 60.7μs -> 47.2μs (28.6% faster)


def test_remove_single_empty_div_with_whitespace():
    # Single empty div with whitespace should be removed
    html = "<div>   \n\t </div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 71.0μs -> 59.8μs (18.7% faster)


def test_div_with_text_not_removed():
    # Div with text should not be removed
    html = "<div>Hello</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 60.0μs -> 55.7μs (7.71% faster)


def test_div_with_child_not_removed():
    # Div with a child element should not be removed
    html = "<div><span>child</span></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 76.0μs -> 69.1μs (9.95% faster)


def test_div_with_attributes_not_removed():
    # Div with any attribute should not be removed
    html = '<div id="main"></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 57.5μs -> 51.8μs (11.0% faster)


def test_multiple_empty_and_nonempty_divs():
    # Only empty divs with no attributes should be removed
    html = "<div></div><div>content</div><div id='x'></div><div> </div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 109μs -> 104μs (4.75% faster)


def test_nested_empty_divs():
    # Nested empty divs should all be removed
    html = "<div><div><div></div></div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 64.4μs -> 60.3μs (6.84% faster)


def test_nested_divs_with_content():
    # Only truly empty nested divs are removed, content is preserved
    html = "<div><div><div>keep</div></div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 83.0μs -> 75.8μs (9.57% faster)


def test_div_with_comment_not_removed():
    # Div with a comment inside is not considered empty
    html = "<div><!-- comment --></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 61.6μs -> 56.1μs (9.80% faster)


# --------------------------
# EDGE TEST CASES
# --------------------------


def test_empty_string():
    # Empty string input returns empty string
    html = ""
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 30.2μs -> 27.8μs (8.38% faster)


def test_html_without_divs():
    # HTML without any divs should be unchanged
    html = "<span>test</span>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 55.2μs -> 51.6μs (6.86% faster)


def test_div_with_only_nbsp():
    # Div with only non-breaking space is not empty
    html = "<div>&nbsp;</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 58.6μs -> 53.9μs (8.82% faster)


def test_div_with_only_html_entity():
    # Div with only an HTML entity is not empty
    html = "<div>&amp;</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 57.5μs -> 53.8μs (6.90% faster)


def test_div_with_only_comment():
    # Div with only a comment is not empty
    html = "<div><!-- comment --></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 58.4μs -> 54.3μs (7.44% faster)


def test_div_with_script_tag():
    # Div containing a script tag is not empty
    html = "<div><script>var x=1;</script></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 77.3μs -> 70.8μs (9.18% faster)


def test_div_with_nested_empty_divs_and_content():
    # Only the truly empty divs are removed, content is preserved
    html = "<div><div></div>content<div> </div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 94.3μs -> 84.6μs (11.5% faster)


def test_div_with_attributes_and_empty_content():
    # Div with attributes but empty content is not removed
    html = '<div class="foo"></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 57.5μs -> 52.4μs (9.71% faster)


def test_div_with_data_attribute_and_empty_content():
    # Div with data attribute but empty content is not removed
    html = '<div data-test="1"></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 52.9μs -> 50.0μs (5.84% faster)


def test_div_with_style_attribute_and_empty_content():
    # Div with style attribute but empty content is not removed
    html = '<div style="display:none"></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 52.8μs -> 48.9μs (8.01% faster)


def test_div_with_nested_empty_divs_and_attributes():
    # Only attribute-less empty divs are removed, attribute divs remain
    html = '<div><div id="1"></div><div></div></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 80.3μs -> 75.8μs (6.00% faster)


def test_div_with_nested_empty_divs_and_comments():
    # Div with only comments is not empty
    html = "<div><!-- comment --></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 58.6μs -> 54.2μs (8.06% faster)


def test_html_with_doctype_and_html_tags():
    # Should not remove doctype or html/body tags
    html = "<!DOCTYPE html><html><body><div></div></body></html>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 80.6μs -> 75.8μs (6.38% faster)


def test_div_with_multiple_kinds_of_whitespace():
    # Div with tabs/newlines/spaces only is empty
    html = "<div>\n\t \r</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 56.8μs -> 51.9μs (9.31% faster)


def test_div_with_nested_empty_divs_and_text():
    # Only deepest empty divs are removed; text is preserved
    html = "<div><div><div></div>hello</div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 81.3μs -> 74.7μs (8.93% faster)


# --------------------------
# LARGE SCALE TEST CASES
# --------------------------


def test_large_number_of_empty_divs():
    # Remove 1000 empty divs
    html = "<div></div>" * 1000
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 20.4ms -> 19.0ms (6.87% faster)


def test_large_number_of_nonempty_divs():
    # 1000 non-empty divs should remain
    html = "".join(f"<div>{i}</div>" for i in range(1000))
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 27.9ms -> 25.4ms (10.0% faster)
    for i in range(1000):
        pass


def test_large_mixed_divs():
    # Mix of 500 empty and 500 non-empty divs
    html = ("<div></div>" * 500) + "".join(f"<div>{i}</div>" for i in range(500))
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 24.1ms -> 22.3ms (8.05% faster)
    for i in range(500):
        pass


def test_large_nested_div_structure():
    # Deeply nested divs, only outermost has content
    html = "content"
    for _ in range(50):
        html = f"<div>{html}</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 720μs -> 635μs (13.3% faster)


def test_large_nested_empty_div_structure():
    # Deeply nested empty divs, all should be removed
    html = ""
    for _ in range(50):
        html = f"<div>{html}</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 486μs -> 450μs (7.82% faster)


def test_large_html_with_mixed_content():
    # 100 divs, every other is empty, rest have content
    html = ""
    for i in range(100):
        if i % 2 == 0:
            html += "<div></div>"
        else:
            html += f"<div>{i}</div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 1.39ms -> 1.31ms (6.52% faster)
    for i in range(1, 100, 2):
        pass


# --------------------------
# ADDITIONAL EDGE CASES
# --------------------------


def test_div_with_nested_span_and_empty():
    # Div with a span and an empty div, only the empty div is removed
    html = "<div><span>text</span><div></div></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 96.4μs -> 79.4μs (21.4% faster)


def test_uppercase_div_tag():
    # HTML is case-insensitive, but BeautifulSoup normalizes to lowercase
    html = "<DIV></DIV>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 47.6μs -> 41.2μs (15.6% faster)


def test_div_with_br_tag():
    # Div with only a <br> is not empty
    html = "<div><br/></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 64.4μs -> 58.0μs (11.1% faster)


def test_div_with_img_tag():
    # Div with only an <img> is not empty
    html = '<div><img src="x.png"/></div>'
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 75.3μs -> 64.0μs (17.6% faster)


def test_div_with_empty_span():
    # Div with only an empty span is not empty
    html = "<div><span></span></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 64.2μs -> 58.5μs (9.75% faster)


def test_div_with_empty_p():
    # Div with only an empty <p> is not empty
    html = "<div><p></p></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 64.4μs -> 58.6μs (9.88% faster)


def test_div_with_multiple_empty_children():
    # Div with only empty children should not be considered empty
    html = "<div><span></span><p></p></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 79.2μs -> 74.9μs (5.78% faster)


def test_div_with_empty_and_nonempty_children():
    # Div with empty and non-empty children is not empty
    html = "<div><span></span><p>text</p></div>"
    codeflash_output = remove_empty_divs_from_html_content(html)
    result = codeflash_output  # 91.7μs -> 80.5μs (13.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-remove_empty_divs_from_html_content-mjccuuy3 and push.

Codeflash Static Badge

The optimized code achieves a **6% speedup** by pre-filtering divs before iteration, reducing the number of attribute checks performed during the expensive `div.unwrap()` operations.

**Key optimization:** The original code checked `div.attrs` for every div in the reversed iteration (6,870 checks in the profiler), but the optimized version filters divs upfront using a list comprehension `[div for div in divs if not div.attrs]`, then only iterates over divs that actually need unwrapping (5,922 iterations vs 6,870).

**Why this is faster:**
- **Reduced attribute lookups**: Instead of checking `div.attrs` during each iteration through all divs, we check it once during filtering
- **Fewer loop iterations**: The main loop only processes divs that will actually be unwrapped (5,861 unwrap calls in both versions, but fewer total iterations)
- **Better cache locality**: Processing a smaller, filtered list improves memory access patterns

**Performance characteristics from tests:**
- Small HTML (single divs): 7-12% faster due to reduced overhead
- Large-scale tests (1000+ divs): 4-10% faster, with the best gains on mixed content where many divs don't need unwrapping
- Nested structures: 8-13% faster as the filtering eliminates unnecessary traversals

**Impact on workloads:** Since this function is called from `parse_html_to_ontology()` in HTML parsing workflows, the 6% improvement will benefit any HTML processing pipeline that handles documents with multiple divs, especially those with a mix of empty and non-empty divs. The optimization is most valuable for large HTML documents where the filtering step saves significant redundant work.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 19, 2025 04:14
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant