Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 19, 2025

📄 190% (1.90x) speedup for is_text_element in unstructured/partition/html/transformations.py

⏱️ Runtime : 6.63 milliseconds 2.28 milliseconds (best of 76 runs)

📝 Explanation and details

The optimization achieves a 189% speedup by eliminating expensive repeated operations and leveraging Python's built-in performance characteristics.

Key optimizations:

  1. Module-level constant definition: Moved text_classes and text_categories to module scope as tuples instead of recreating them on every function call. This eliminates ~70% of the original runtime spent reconstructing these collections (8.7ms → 0ms in line profiler).

  2. Tuple vs List optimization: Changed from lists to tuples, which are more memory-efficient and faster for isinstance() checks since Python can optimize tuple-based type checking.

  3. Eliminated generator expressions: Replaced any(isinstance(...) for ...) with direct isinstance(ontology_element, text_classes), which is significantly faster as it avoids generator overhead and uses Python's optimized C implementation for multiple type checking.

  4. Direct indexing: Replaced any(... for category in text_categories) with direct comparison ontology_element.elementType == text_categories[0] since there's only one category, eliminating loop overhead.

Performance impact: The line profiler shows the critical path went from 52.3% + 15.9% = 68.2% of runtime in generator expressions to 67.5% + 32.5% = 100% in just two optimized operations, but with 7.8x less total time.

Hot path relevance: Based on function_references, this function is called from can_unstructured_elements_be_merged() within a loop processing multiple HTML elements. The optimization will significantly speed up HTML parsing workflows where element classification happens frequently.

Test case performance: All test cases show consistent 140-213% speedups, with the optimization being particularly effective for large-scale processing (800+ elements) showing ~190% improvements, making it ideal for batch document processing scenarios.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 6142 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 80.0%
🌀 Generated Regression Tests and Runtime
from unstructured.partition.html.transformations import is_text_element


# --- Mocking minimal ontology classes and enums for testing ---
# These are minimal stand-ins for the real ontology classes/enums
class ElementTypeEnum:
    metadata = "metadata"
    other = "other"


class OntologyElement:
    def __init__(self, elementType=None):
        self.elementType = elementType


# Text classes (should return True)
class NarrativeText(OntologyElement):
    pass


class Quote(OntologyElement):
    pass


class Paragraph(OntologyElement):
    pass


class Footnote(OntologyElement):
    pass


class FootnoteReference(OntologyElement):
    pass


class Citation(OntologyElement):
    pass


class Bibliography(OntologyElement):
    pass


class Glossary(OntologyElement):
    pass


# Non-text class for negative tests
class Table(OntologyElement):
    pass


class Figure(OntologyElement):
    pass


# Simulate the ontology module
class ontology:
    NarrativeText = NarrativeText
    Quote = Quote
    Paragraph = Paragraph
    Footnote = Footnote
    FootnoteReference = FootnoteReference
    Citation = Citation
    Bibliography = Bibliography
    Glossary = Glossary
    ElementTypeEnum = ElementTypeEnum
    OntologyElement = OntologyElement


# ------------------- UNIT TESTS -------------------


# 1. Basic Test Cases
def test_narrative_text_is_text_element():
    # Should return True for NarrativeText instance
    elem = ontology.NarrativeText()
    codeflash_output = is_text_element(elem)  # 2.58μs -> 1.04μs (148% faster)


def test_quote_is_text_element():
    # Should return True for Quote instance
    elem = ontology.Quote()
    codeflash_output = is_text_element(elem)  # 2.17μs -> 750ns (189% faster)


def test_paragraph_is_text_element():
    # Should return True for Paragraph instance
    elem = ontology.Paragraph()
    codeflash_output = is_text_element(elem)  # 1.96μs -> 667ns (194% faster)


def test_footnote_is_text_element():
    # Should return True for Footnote instance
    elem = ontology.Footnote()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 667ns (181% faster)


def test_footnote_reference_is_text_element():
    # Should return True for FootnoteReference instance
    elem = ontology.FootnoteReference()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 708ns (165% faster)


def test_citation_is_text_element():
    # Should return True for Citation instance
    elem = ontology.Citation()
    codeflash_output = is_text_element(elem)  # 1.83μs -> 666ns (175% faster)


def test_bibliography_is_text_element():
    # Should return True for Bibliography instance
    elem = ontology.Bibliography()
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


def test_glossary_is_text_element():
    # Should return True for Glossary instance
    elem = ontology.Glossary()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 666ns (182% faster)


def test_metadata_category_is_text_element():
    # Should return True for OntologyElement with elementType == metadata
    elem = ontology.OntologyElement(elementType=ontology.ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.92μs -> 708ns (171% faster)


def test_other_category_is_not_text_element():
    # Should return False for OntologyElement with elementType != metadata
    elem = ontology.OntologyElement(elementType=ontology.ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 667ns (169% faster)


def test_table_is_not_text_element():
    # Should return False for Table instance (not in text_classes, not metadata)
    elem = Table()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 750ns (150% faster)


def test_figure_is_not_text_element():
    # Should return False for Figure instance (not in text_classes, not metadata)
    elem = Figure()
    codeflash_output = is_text_element(elem)  # 1.79μs -> 667ns (169% faster)


# 2. Edge Test Cases


def test_elementType_is_none():
    # Should return False if elementType is None
    elem = ontology.OntologyElement(elementType=None)
    codeflash_output = is_text_element(elem)  # 2.25μs -> 917ns (145% faster)


def test_elementType_is_empty_string():
    # Should return False if elementType is empty string
    elem = ontology.OntologyElement(elementType="")
    codeflash_output = is_text_element(elem)  # 2.08μs -> 708ns (194% faster)


def test_text_class_with_metadata_category():
    # Should return True even if elementType is not metadata, because class is text
    elem = ontology.Paragraph(elementType=ontology.ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.88μs -> 708ns (165% faster)


def test_text_class_with_metadata_category_true():
    # Should return True if both class and category are text
    elem = ontology.Paragraph(elementType=ontology.ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.92μs -> 708ns (171% faster)


def test_non_text_class_with_metadata_category():
    # Should return True if elementType is metadata even if class is not in text_classes
    elem = Table(elementType=ontology.ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.83μs -> 666ns (175% faster)


def test_non_text_class_with_non_metadata_category():
    # Should return False if neither class nor category are text
    elem = Table(elementType=ontology.ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


def test_elementType_is_integer():
    # Should return False if elementType is an integer
    elem = ontology.OntologyElement(elementType=123)
    codeflash_output = is_text_element(elem)  # 2.62μs -> 875ns (200% faster)


def test_elementType_is_object():
    # Should return False if elementType is an object
    class Dummy:
        pass

    elem = ontology.OntologyElement(elementType=Dummy())
    codeflash_output = is_text_element(elem)  # 2.17μs -> 750ns (189% faster)


def test_elementType_is_list():
    # Should return False if elementType is a list
    elem = ontology.OntologyElement(elementType=["metadata"])
    codeflash_output = is_text_element(elem)  # 1.96μs -> 708ns (177% faster)


# 3. Large Scale Test Cases


def test_many_text_elements():
    # Create 1000 text elements and assert all return True
    elems = [ontology.Paragraph() for _ in range(1000)]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.08ms -> 372μs (190% faster)


def test_many_non_text_elements():
    # Create 1000 non-text elements and assert all return False
    elems = [Table(elementType=ontology.ElementTypeEnum.other) for _ in range(1000)]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.07ms -> 370μs (189% faster)


def test_large_metadata_elements():
    # Create 1000 OntologyElement with elementType=metadata
    elems = [
        ontology.OntologyElement(elementType=ontology.ElementTypeEnum.metadata) for _ in range(1000)
    ]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.07ms -> 367μs (191% faster)


def test_large_non_metadata_elements():
    # Create 1000 OntologyElement with elementType=other
    elems = [
        ontology.OntologyElement(elementType=ontology.ElementTypeEnum.other) for _ in range(1000)
    ]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.06ms -> 366μs (190% faster)
from unstructured.partition.html.transformations import is_text_element

# --- Begin: Minimal mocks for ontology elements and enums ---
# These are minimal implementations to allow the test suite to run standalone.
# In real usage, they would come from unstructured.documents.ontology.


class ElementTypeEnum:
    metadata = "metadata"
    other = "other"
    something_else = "something_else"


class OntologyElement:
    def __init__(self, elementType=None):
        self.elementType = elementType


# Each text class is a subclass of OntologyElement.
class NarrativeText(OntologyElement):
    pass


class Quote(OntologyElement):
    pass


class Paragraph(OntologyElement):
    pass


class Footnote(OntologyElement):
    pass


class FootnoteReference(OntologyElement):
    pass


class Citation(OntologyElement):
    pass


class Bibliography(OntologyElement):
    pass


class Glossary(OntologyElement):
    pass


# ---------------------------
# Unit tests for is_text_element
# ---------------------------

# 1. Basic Test Cases


def test_is_text_element_narrative_text():
    # Should return True for NarrativeText instance
    codeflash_output = is_text_element(NarrativeText())  # 2.50μs -> 917ns (173% faster)


def test_is_text_element_quote():
    # Should return True for Quote instance
    codeflash_output = is_text_element(Quote())  # 2.08μs -> 708ns (194% faster)


def test_is_text_element_paragraph():
    # Should return True for Paragraph instance
    codeflash_output = is_text_element(Paragraph())  # 1.96μs -> 625ns (213% faster)


def test_is_text_element_footnote():
    # Should return True for Footnote instance
    codeflash_output = is_text_element(Footnote())  # 1.92μs -> 625ns (207% faster)


def test_is_text_element_footnote_reference():
    # Should return True for FootnoteReference instance
    codeflash_output = is_text_element(FootnoteReference())  # 1.83μs -> 625ns (193% faster)


def test_is_text_element_citation():
    # Should return True for Citation instance
    codeflash_output = is_text_element(Citation())  # 1.88μs -> 708ns (165% faster)


def test_is_text_element_bibliography():
    # Should return True for Bibliography instance
    codeflash_output = is_text_element(Bibliography())  # 1.88μs -> 625ns (200% faster)


def test_is_text_element_glossary():
    # Should return True for Glossary instance
    codeflash_output = is_text_element(Glossary())  # 1.83μs -> 625ns (193% faster)


def test_is_text_element_metadata_category():
    # Should return True for OntologyElement with elementType == metadata
    elem = OntologyElement(elementType=ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.92μs -> 708ns (171% faster)


def test_is_text_element_other_category():
    # Should return False for OntologyElement with elementType == other
    elem = OntologyElement(elementType=ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.83μs -> 625ns (193% faster)


def test_is_text_element_something_else_category():
    # Should return False for OntologyElement with elementType == something_else
    elem = OntologyElement(elementType=ElementTypeEnum.something_else)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


def test_is_text_element_base_class_no_type():
    # Should return False for bare OntologyElement with no elementType
    elem = OntologyElement()
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


# 2. Edge Test Cases


def test_is_text_element_subclass_of_text_class():
    # Should return True for subclass of Paragraph
    class CustomParagraph(Paragraph):
        pass

    elem = CustomParagraph()
    codeflash_output = is_text_element(elem)  # 1.96μs -> 833ns (135% faster)


def test_is_text_element_multiple_inheritance():
    # Should return True if one base class is a text class
    class MixedElement(Paragraph, OntologyElement):
        pass

    elem = MixedElement()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 708ns (165% faster)


def test_is_text_element_elementType_is_none():
    # Should return False if elementType is explicitly set to None
    elem = OntologyElement(elementType=None)
    codeflash_output = is_text_element(elem)  # 2.42μs -> 1.00μs (142% faster)


def test_is_text_element_elementType_wrong_type():
    # Should return False if elementType is not a string or doesn't match
    elem = OntologyElement(elementType=123)
    codeflash_output = is_text_element(elem)  # 2.12μs -> 791ns (169% faster)


def test_is_text_element_elementType_case_sensitivity():
    # Should return False if elementType matches but with wrong case
    elem = OntologyElement(elementType="Metadata")
    codeflash_output = is_text_element(elem)  # 2.12μs -> 708ns (200% faster)


def test_is_text_element_elementType_partial_match():
    # Should return False if elementType partially matches 'metadata'
    elem = OntologyElement(elementType="meta")
    codeflash_output = is_text_element(elem)  # 1.96μs -> 666ns (194% faster)


def test_is_text_element_extra_attributes():
    # Should return False for element with extra unrelated attributes
    class WeirdElement(OntologyElement):
        def __init__(self):
            super().__init__(elementType="other")
            self.foo = "bar"

    elem = WeirdElement()
    codeflash_output = is_text_element(elem)  # 2.00μs -> 750ns (167% faster)


def test_is_text_element_text_class_with_metadata_type():
    # Should return True for text class with elementType == metadata
    elem = Paragraph(elementType=ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 666ns (169% faster)


def test_is_text_element_text_class_with_other_type():
    # Should return True for text class with elementType == other (type ignored)
    elem = Paragraph(elementType=ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.83μs -> 625ns (193% faster)


# 3. Large Scale Test Cases


def test_is_text_element_large_batch_text_classes():
    # Should return True for all instances of text classes in a large batch
    elements = [
        Paragraph(),
        Quote(),
        NarrativeText(),
        Footnote(),
        FootnoteReference(),
        Citation(),
        Bibliography(),
        Glossary(),
    ] * 100  # 800 elements
    for elem in elements:
        codeflash_output = is_text_element(elem)  # 860μs -> 296μs (190% faster)


def test_is_text_element_large_batch_non_text():
    # Should return False for non-text elements in a large batch
    elements = [OntologyElement(elementType=ElementTypeEnum.other) for _ in range(500)]
    for elem in elements:
        codeflash_output = is_text_element(elem)  # 534μs -> 185μs (188% faster)


def test_is_text_element_large_batch_mixed():
    # Should correctly identify mixed batch of text and non-text elements
    elements = [
        Paragraph(),
        Quote(),
        OntologyElement(elementType=ElementTypeEnum.metadata),
    ] * 200 + [OntologyElement(elementType=ElementTypeEnum.other)] * 200
    expected = [True, True, True] * 200 + [False] * 200
    for elem, exp in zip(elements, expected):
        codeflash_output = is_text_element(elem)  # 861μs -> 295μs (191% faster)

To edit these changes git checkout codeflash/optimize-is_text_element-mjccgnlf and push.

Codeflash Static Badge

The optimization achieves a **189% speedup** by eliminating expensive repeated operations and leveraging Python's built-in performance characteristics.

**Key optimizations:**

1. **Module-level constant definition**: Moved `text_classes` and `text_categories` to module scope as tuples instead of recreating them on every function call. This eliminates ~70% of the original runtime spent reconstructing these collections (8.7ms → 0ms in line profiler).

2. **Tuple vs List optimization**: Changed from lists to tuples, which are more memory-efficient and faster for `isinstance()` checks since Python can optimize tuple-based type checking.

3. **Eliminated generator expressions**: Replaced `any(isinstance(...) for ...)` with direct `isinstance(ontology_element, text_classes)`, which is significantly faster as it avoids generator overhead and uses Python's optimized C implementation for multiple type checking.

4. **Direct indexing**: Replaced `any(... for category in text_categories)` with direct comparison `ontology_element.elementType == text_categories[0]` since there's only one category, eliminating loop overhead.

**Performance impact**: The line profiler shows the critical path went from 52.3% + 15.9% = 68.2% of runtime in generator expressions to 67.5% + 32.5% = 100% in just two optimized operations, but with 7.8x less total time.

**Hot path relevance**: Based on `function_references`, this function is called from `can_unstructured_elements_be_merged()` within a loop processing multiple HTML elements. The optimization will significantly speed up HTML parsing workflows where element classification happens frequently.

**Test case performance**: All test cases show consistent 140-213% speedups, with the optimization being particularly effective for large-scale processing (800+ elements) showing ~190% improvements, making it ideal for batch document processing scenarios.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 19, 2025 04:03
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants