⚡️ Speed up function `is_text_element` by 190% #6

codeflash-ai · 2025-12-19T04:03:26Z

📄 190% (1.90x) speedup for `is_text_element` in `unstructured/partition/html/transformations.py`

⏱️ Runtime : 6.63 milliseconds → 2.28 milliseconds (best of 76 runs)

📝 Explanation and details

The optimization achieves a 189% speedup by eliminating expensive repeated operations and leveraging Python's built-in performance characteristics.

Key optimizations:

Module-level constant definition: Moved text_classes and text_categories to module scope as tuples instead of recreating them on every function call. This eliminates ~70% of the original runtime spent reconstructing these collections (8.7ms → 0ms in line profiler).
Tuple vs List optimization: Changed from lists to tuples, which are more memory-efficient and faster for isinstance() checks since Python can optimize tuple-based type checking.
Eliminated generator expressions: Replaced any(isinstance(...) for ...) with direct isinstance(ontology_element, text_classes), which is significantly faster as it avoids generator overhead and uses Python's optimized C implementation for multiple type checking.
Direct indexing: Replaced any(... for category in text_categories) with direct comparison ontology_element.elementType == text_categories[0] since there's only one category, eliminating loop overhead.

Performance impact: The line profiler shows the critical path went from 52.3% + 15.9% = 68.2% of runtime in generator expressions to 67.5% + 32.5% = 100% in just two optimized operations, but with 7.8x less total time.

Hot path relevance: Based on function_references, this function is called from can_unstructured_elements_be_merged() within a loop processing multiple HTML elements. The optimization will significantly speed up HTML parsing workflows where element classification happens frequently.

Test case performance: All test cases show consistent 140-213% speedups, with the optimization being particularly effective for large-scale processing (800+ elements) showing ~190% improvements, making it ideal for batch document processing scenarios.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 6142 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	80.0%

🌀 Generated Regression Tests and Runtime

from unstructured.partition.html.transformations import is_text_element


# --- Mocking minimal ontology classes and enums for testing ---
# These are minimal stand-ins for the real ontology classes/enums
class ElementTypeEnum:
    metadata = "metadata"
    other = "other"


class OntologyElement:
    def __init__(self, elementType=None):
        self.elementType = elementType


# Text classes (should return True)
class NarrativeText(OntologyElement):
    pass


class Quote(OntologyElement):
    pass


class Paragraph(OntologyElement):
    pass


class Footnote(OntologyElement):
    pass


class FootnoteReference(OntologyElement):
    pass


class Citation(OntologyElement):
    pass


class Bibliography(OntologyElement):
    pass


class Glossary(OntologyElement):
    pass


# Non-text class for negative tests
class Table(OntologyElement):
    pass


class Figure(OntologyElement):
    pass


# Simulate the ontology module
class ontology:
    NarrativeText = NarrativeText
    Quote = Quote
    Paragraph = Paragraph
    Footnote = Footnote
    FootnoteReference = FootnoteReference
    Citation = Citation
    Bibliography = Bibliography
    Glossary = Glossary
    ElementTypeEnum = ElementTypeEnum
    OntologyElement = OntologyElement


# ------------------- UNIT TESTS -------------------


# 1. Basic Test Cases
def test_narrative_text_is_text_element():
    # Should return True for NarrativeText instance
    elem = ontology.NarrativeText()
    codeflash_output = is_text_element(elem)  # 2.58μs -> 1.04μs (148% faster)


def test_quote_is_text_element():
    # Should return True for Quote instance
    elem = ontology.Quote()
    codeflash_output = is_text_element(elem)  # 2.17μs -> 750ns (189% faster)


def test_paragraph_is_text_element():
    # Should return True for Paragraph instance
    elem = ontology.Paragraph()
    codeflash_output = is_text_element(elem)  # 1.96μs -> 667ns (194% faster)


def test_footnote_is_text_element():
    # Should return True for Footnote instance
    elem = ontology.Footnote()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 667ns (181% faster)


def test_footnote_reference_is_text_element():
    # Should return True for FootnoteReference instance
    elem = ontology.FootnoteReference()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 708ns (165% faster)


def test_citation_is_text_element():
    # Should return True for Citation instance
    elem = ontology.Citation()
    codeflash_output = is_text_element(elem)  # 1.83μs -> 666ns (175% faster)


def test_bibliography_is_text_element():
    # Should return True for Bibliography instance
    elem = ontology.Bibliography()
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


def test_glossary_is_text_element():
    # Should return True for Glossary instance
    elem = ontology.Glossary()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 666ns (182% faster)


def test_metadata_category_is_text_element():
    # Should return True for OntologyElement with elementType == metadata
    elem = ontology.OntologyElement(elementType=ontology.ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.92μs -> 708ns (171% faster)


def test_other_category_is_not_text_element():
    # Should return False for OntologyElement with elementType != metadata
    elem = ontology.OntologyElement(elementType=ontology.ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 667ns (169% faster)


def test_table_is_not_text_element():
    # Should return False for Table instance (not in text_classes, not metadata)
    elem = Table()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 750ns (150% faster)


def test_figure_is_not_text_element():
    # Should return False for Figure instance (not in text_classes, not metadata)
    elem = Figure()
    codeflash_output = is_text_element(elem)  # 1.79μs -> 667ns (169% faster)


# 2. Edge Test Cases


def test_elementType_is_none():
    # Should return False if elementType is None
    elem = ontology.OntologyElement(elementType=None)
    codeflash_output = is_text_element(elem)  # 2.25μs -> 917ns (145% faster)


def test_elementType_is_empty_string():
    # Should return False if elementType is empty string
    elem = ontology.OntologyElement(elementType="")
    codeflash_output = is_text_element(elem)  # 2.08μs -> 708ns (194% faster)


def test_text_class_with_metadata_category():
    # Should return True even if elementType is not metadata, because class is text
    elem = ontology.Paragraph(elementType=ontology.ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.88μs -> 708ns (165% faster)


def test_text_class_with_metadata_category_true():
    # Should return True if both class and category are text
    elem = ontology.Paragraph(elementType=ontology.ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.92μs -> 708ns (171% faster)


def test_non_text_class_with_metadata_category():
    # Should return True if elementType is metadata even if class is not in text_classes
    elem = Table(elementType=ontology.ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.83μs -> 666ns (175% faster)


def test_non_text_class_with_non_metadata_category():
    # Should return False if neither class nor category are text
    elem = Table(elementType=ontology.ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


def test_elementType_is_integer():
    # Should return False if elementType is an integer
    elem = ontology.OntologyElement(elementType=123)
    codeflash_output = is_text_element(elem)  # 2.62μs -> 875ns (200% faster)


def test_elementType_is_object():
    # Should return False if elementType is an object
    class Dummy:
        pass

    elem = ontology.OntologyElement(elementType=Dummy())
    codeflash_output = is_text_element(elem)  # 2.17μs -> 750ns (189% faster)


def test_elementType_is_list():
    # Should return False if elementType is a list
    elem = ontology.OntologyElement(elementType=["metadata"])
    codeflash_output = is_text_element(elem)  # 1.96μs -> 708ns (177% faster)


# 3. Large Scale Test Cases


def test_many_text_elements():
    # Create 1000 text elements and assert all return True
    elems = [ontology.Paragraph() for _ in range(1000)]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.08ms -> 372μs (190% faster)


def test_many_non_text_elements():
    # Create 1000 non-text elements and assert all return False
    elems = [Table(elementType=ontology.ElementTypeEnum.other) for _ in range(1000)]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.07ms -> 370μs (189% faster)


def test_large_metadata_elements():
    # Create 1000 OntologyElement with elementType=metadata
    elems = [
        ontology.OntologyElement(elementType=ontology.ElementTypeEnum.metadata) for _ in range(1000)
    ]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.07ms -> 367μs (191% faster)


def test_large_non_metadata_elements():
    # Create 1000 OntologyElement with elementType=other
    elems = [
        ontology.OntologyElement(elementType=ontology.ElementTypeEnum.other) for _ in range(1000)
    ]
    for elem in elems:
        codeflash_output = is_text_element(elem)  # 1.06ms -> 366μs (190% faster)

from unstructured.partition.html.transformations import is_text_element

# --- Begin: Minimal mocks for ontology elements and enums ---
# These are minimal implementations to allow the test suite to run standalone.
# In real usage, they would come from unstructured.documents.ontology.


class ElementTypeEnum:
    metadata = "metadata"
    other = "other"
    something_else = "something_else"


class OntologyElement:
    def __init__(self, elementType=None):
        self.elementType = elementType


# Each text class is a subclass of OntologyElement.
class NarrativeText(OntologyElement):
    pass


class Quote(OntologyElement):
    pass


class Paragraph(OntologyElement):
    pass


class Footnote(OntologyElement):
    pass


class FootnoteReference(OntologyElement):
    pass


class Citation(OntologyElement):
    pass


class Bibliography(OntologyElement):
    pass


class Glossary(OntologyElement):
    pass


# ---------------------------
# Unit tests for is_text_element
# ---------------------------

# 1. Basic Test Cases


def test_is_text_element_narrative_text():
    # Should return True for NarrativeText instance
    codeflash_output = is_text_element(NarrativeText())  # 2.50μs -> 917ns (173% faster)


def test_is_text_element_quote():
    # Should return True for Quote instance
    codeflash_output = is_text_element(Quote())  # 2.08μs -> 708ns (194% faster)


def test_is_text_element_paragraph():
    # Should return True for Paragraph instance
    codeflash_output = is_text_element(Paragraph())  # 1.96μs -> 625ns (213% faster)


def test_is_text_element_footnote():
    # Should return True for Footnote instance
    codeflash_output = is_text_element(Footnote())  # 1.92μs -> 625ns (207% faster)


def test_is_text_element_footnote_reference():
    # Should return True for FootnoteReference instance
    codeflash_output = is_text_element(FootnoteReference())  # 1.83μs -> 625ns (193% faster)


def test_is_text_element_citation():
    # Should return True for Citation instance
    codeflash_output = is_text_element(Citation())  # 1.88μs -> 708ns (165% faster)


def test_is_text_element_bibliography():
    # Should return True for Bibliography instance
    codeflash_output = is_text_element(Bibliography())  # 1.88μs -> 625ns (200% faster)


def test_is_text_element_glossary():
    # Should return True for Glossary instance
    codeflash_output = is_text_element(Glossary())  # 1.83μs -> 625ns (193% faster)


def test_is_text_element_metadata_category():
    # Should return True for OntologyElement with elementType == metadata
    elem = OntologyElement(elementType=ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.92μs -> 708ns (171% faster)


def test_is_text_element_other_category():
    # Should return False for OntologyElement with elementType == other
    elem = OntologyElement(elementType=ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.83μs -> 625ns (193% faster)


def test_is_text_element_something_else_category():
    # Should return False for OntologyElement with elementType == something_else
    elem = OntologyElement(elementType=ElementTypeEnum.something_else)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


def test_is_text_element_base_class_no_type():
    # Should return False for bare OntologyElement with no elementType
    elem = OntologyElement()
    codeflash_output = is_text_element(elem)  # 1.79μs -> 625ns (187% faster)


# 2. Edge Test Cases


def test_is_text_element_subclass_of_text_class():
    # Should return True for subclass of Paragraph
    class CustomParagraph(Paragraph):
        pass

    elem = CustomParagraph()
    codeflash_output = is_text_element(elem)  # 1.96μs -> 833ns (135% faster)


def test_is_text_element_multiple_inheritance():
    # Should return True if one base class is a text class
    class MixedElement(Paragraph, OntologyElement):
        pass

    elem = MixedElement()
    codeflash_output = is_text_element(elem)  # 1.88μs -> 708ns (165% faster)


def test_is_text_element_elementType_is_none():
    # Should return False if elementType is explicitly set to None
    elem = OntologyElement(elementType=None)
    codeflash_output = is_text_element(elem)  # 2.42μs -> 1.00μs (142% faster)


def test_is_text_element_elementType_wrong_type():
    # Should return False if elementType is not a string or doesn't match
    elem = OntologyElement(elementType=123)
    codeflash_output = is_text_element(elem)  # 2.12μs -> 791ns (169% faster)


def test_is_text_element_elementType_case_sensitivity():
    # Should return False if elementType matches but with wrong case
    elem = OntologyElement(elementType="Metadata")
    codeflash_output = is_text_element(elem)  # 2.12μs -> 708ns (200% faster)


def test_is_text_element_elementType_partial_match():
    # Should return False if elementType partially matches 'metadata'
    elem = OntologyElement(elementType="meta")
    codeflash_output = is_text_element(elem)  # 1.96μs -> 666ns (194% faster)


def test_is_text_element_extra_attributes():
    # Should return False for element with extra unrelated attributes
    class WeirdElement(OntologyElement):
        def __init__(self):
            super().__init__(elementType="other")
            self.foo = "bar"

    elem = WeirdElement()
    codeflash_output = is_text_element(elem)  # 2.00μs -> 750ns (167% faster)


def test_is_text_element_text_class_with_metadata_type():
    # Should return True for text class with elementType == metadata
    elem = Paragraph(elementType=ElementTypeEnum.metadata)
    codeflash_output = is_text_element(elem)  # 1.79μs -> 666ns (169% faster)


def test_is_text_element_text_class_with_other_type():
    # Should return True for text class with elementType == other (type ignored)
    elem = Paragraph(elementType=ElementTypeEnum.other)
    codeflash_output = is_text_element(elem)  # 1.83μs -> 625ns (193% faster)


# 3. Large Scale Test Cases


def test_is_text_element_large_batch_text_classes():
    # Should return True for all instances of text classes in a large batch
    elements = [
        Paragraph(),
        Quote(),
        NarrativeText(),
        Footnote(),
        FootnoteReference(),
        Citation(),
        Bibliography(),
        Glossary(),
    ] * 100  # 800 elements
    for elem in elements:
        codeflash_output = is_text_element(elem)  # 860μs -> 296μs (190% faster)


def test_is_text_element_large_batch_non_text():
    # Should return False for non-text elements in a large batch
    elements = [OntologyElement(elementType=ElementTypeEnum.other) for _ in range(500)]
    for elem in elements:
        codeflash_output = is_text_element(elem)  # 534μs -> 185μs (188% faster)


def test_is_text_element_large_batch_mixed():
    # Should correctly identify mixed batch of text and non-text elements
    elements = [
        Paragraph(),
        Quote(),
        OntologyElement(elementType=ElementTypeEnum.metadata),
    ] * 200 + [OntologyElement(elementType=ElementTypeEnum.other)] * 200
    expected = [True, True, True] * 200 + [False] * 200
    for elem, exp in zip(elements, expected):
        codeflash_output = is_text_element(elem)  # 861μs -> 295μs (191% faster)

To edit these changes git checkout codeflash/optimize-is_text_element-mjccgnlf and push.

The optimization achieves a **189% speedup** by eliminating expensive repeated operations and leveraging Python's built-in performance characteristics. **Key optimizations:** 1. **Module-level constant definition**: Moved `text_classes` and `text_categories` to module scope as tuples instead of recreating them on every function call. This eliminates ~70% of the original runtime spent reconstructing these collections (8.7ms → 0ms in line profiler). 2. **Tuple vs List optimization**: Changed from lists to tuples, which are more memory-efficient and faster for `isinstance()` checks since Python can optimize tuple-based type checking. 3. **Eliminated generator expressions**: Replaced `any(isinstance(...) for ...)` with direct `isinstance(ontology_element, text_classes)`, which is significantly faster as it avoids generator overhead and uses Python's optimized C implementation for multiple type checking. 4. **Direct indexing**: Replaced `any(... for category in text_categories)` with direct comparison `ontology_element.elementType == text_categories[0]` since there's only one category, eliminating loop overhead. **Performance impact**: The line profiler shows the critical path went from 52.3% + 15.9% = 68.2% of runtime in generator expressions to 67.5% + 32.5% = 100% in just two optimized operations, but with 7.8x less total time. **Hot path relevance**: Based on `function_references`, this function is called from `can_unstructured_elements_be_merged()` within a loop processing multiple HTML elements. The optimization will significantly speed up HTML parsing workflows where element classification happens frequently. **Test case performance**: All test cases show consistent 140-213% speedups, with the optimization being particularly effective for large-scale processing (800+ elements) showing ~190% improvements, making it ideal for batch document processing scenarios.

codeflash-ai bot requested a review from aseembits93 December 19, 2025 04:03

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025

misrasaurabh1 approved these changes Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `is_text_element` by 190% #6

⚡️ Speed up function `is_text_element` by 190% #6

Uh oh!

codeflash-ai bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function is_text_element by 190% #6

Are you sure you want to change the base?

⚡️ Speed up function is_text_element by 190% #6

Uh oh!

Conversation

codeflash-ai bot commented Dec 19, 2025

📄 190% (1.90x) speedup for is_text_element in unstructured/partition/html/transformations.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `is_text_element` by 190% #6

⚡️ Speed up function `is_text_element` by 190% #6

📄 190% (1.90x) speedup for `is_text_element` in `unstructured/partition/html/transformations.py`