Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 19, 2025

📄 27% (0.27x) speedup for _uniquity_file in unstructured/metrics/utils.py

⏱️ Runtime : 8.23 milliseconds 6.48 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization achieves a 26% speedup by making two key changes to regex handling in _uniquity_file:

What was optimized:

  1. Pre-compiled regex pattern: Changed from re.match(pattern, f) to pattern = re.compile(...); pattern.match(f) - compiling the regex once upfront instead of recompiling it for every file in the list.
  2. Separate filtering and sorting: Split the combined sorted([f for f in file_list if re.match(pattern, f)], key=_sorting_key) into two steps: first filter with list comprehension, then sort separately.

Why this is faster:

  • Regex compilation overhead eliminated: The original code recompiled the same regex pattern for every file (potentially thousands of times). Pre-compiling saves this repeated work.
  • Better memory access patterns: Separating filtering from sorting allows Python to optimize each operation independently, reducing temporary object creation during the combined operation.

Performance impact analysis:
From the line profiler, the critical line went from 31.9ms (75.3% of total time) to 16.1ms + 2.9ms = 19.0ms total (54.8% of total time) - a 40% improvement on the hottest code path.

When this optimization matters most:
Based on the annotated tests, the biggest gains occur with:

  • Large file lists with many non-matching files (280-290% speedup)
  • Lists with 500-1000+ duplicates (15-16% speedup)
  • Mixed scenarios with both matching and non-matching files (28% speedup)

Function context impact:
Since _uniquity_file is called by _get_non_duplicated_filename which processes directory listings, this optimization will significantly improve performance when dealing with directories containing many files, making file deduplication operations much faster in real-world scenarios.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 2 Passed
🌀 Generated Regression Tests 60 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
metrics/test_utils.py::test_uniquity_file 7.92μs 7.12μs 11.1%✅
🌀 Generated Regression Tests and Runtime
# imports
from unstructured.metrics.utils import _uniquity_file

# unit tests

# 1. Basic Test Cases


def test_no_duplicates():
    # If there are no files with the same name, should return "filename (1).ext"
    files = ["other.txt", "something.doc"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 3.08μs -> 2.92μs (5.69% faster)


def test_one_duplicate():
    # If "file.txt" exists, should suggest "file (1).txt"
    files = ["file.txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 3.79μs -> 3.83μs (1.07% slower)


def test_multiple_duplicates():
    # If "file.txt", "file (1).txt" exist, should suggest "file (2).txt"
    files = ["file.txt", "file (1).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 5.71μs -> 5.50μs (3.78% faster)


def test_gaps_in_numbers():
    # If "file.txt", "file (2).txt" exist, should suggest "file (1).txt"
    files = ["file.txt", "file (2).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 5.50μs -> 5.25μs (4.76% faster)


def test_non_matching_files():
    # Only files with the exact base name and extension should be considered
    files = ["file.txt", "file (1).txt", "file.doc", "file (1).doc", "file1.txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 6.83μs -> 5.83μs (17.1% faster)


def test_highest_number():
    # Should fill the lowest available number
    files = ["file.txt", "file (1).txt", "file (3).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 7.12μs -> 6.67μs (6.87% faster)


def test_sequential_numbers():
    # Should suggest the next number in sequence
    files = ["file.txt", "file (1).txt", "file (2).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 7.04μs -> 6.54μs (7.66% faster)


def test_filename_with_spaces():
    # Handles filenames with spaces
    files = ["my file.txt", "my file (1).txt"]
    codeflash_output = _uniquity_file(files, "my file.txt")  # 6.04μs -> 5.75μs (5.08% faster)


def test_filename_with_multiple_dots():
    # Handles filenames with multiple dots
    files = ["my.file.txt", "my.file (1).txt"]
    codeflash_output = _uniquity_file(files, "my.file.txt")  # 5.83μs -> 5.62μs (3.72% faster)


def test_extension_with_number():
    # Should not confuse extension numbers with base name numbers
    files = ["file.1.txt", "file.1 (1).txt"]
    codeflash_output = _uniquity_file(files, "file.1.txt")  # 6.00μs -> 5.83μs (2.86% faster)


# 2. Edge Test Cases


def test_filename_with_parentheses():
    # Handles filenames that already have parentheses in the base name
    files = ["file (test).txt", "file (test) (1).txt"]
    codeflash_output = _uniquity_file(files, "file (test).txt")  # 6.29μs -> 6.12μs (2.71% faster)


def test_filename_with_number_in_name():
    # Handles filenames with numbers in the base name
    files = ["file2.txt", "file2 (1).txt"]
    codeflash_output = _uniquity_file(files, "file2.txt")  # 6.00μs -> 5.75μs (4.35% faster)


def test_filename_with_brackets_and_number():
    # Handles filenames with both brackets and numbers in the base name
    files = ["file (2023).txt", "file (2023) (1).txt"]
    codeflash_output = _uniquity_file(files, "file (2023).txt")  # 6.88μs -> 6.62μs (3.77% faster)


def test_files_with_similar_names():
    # Only exact matches should be considered
    files = ["file.txt", "file1.txt", "file (1).txt", "file (2).txt", "file (1).doc"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 8.00μs -> 7.12μs (12.3% faster)


def test_files_with_large_numbers():
    # Should handle large numbers in parentheses
    files = ["file.txt", "file (999).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 5.71μs -> 5.50μs (3.78% faster)


def test_files_with_double_digit_gaps():
    # Should fill the lowest available number
    files = ["file.txt", "file (1).txt", "file (10).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 7.21μs -> 6.75μs (6.79% faster)


def test_files_with_zero_in_parentheses():
    # Should ignore (0) as it's not a valid duplicate number (should start at 1)
    files = ["file.txt", "file (0).txt", "file (1).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 7.00μs -> 6.58μs (6.33% faster)


def test_files_with_non_integer_brackets():
    # Should ignore files with non-integer brackets
    files = ["file.txt", "file (one).txt", "file (1).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 5.96μs -> 5.50μs (8.35% faster)


def test_files_with_partial_match():
    # Should not match files like "file (1).txtx"
    files = ["file.txt", "file (1).txtx"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 4.38μs -> 4.08μs (7.13% faster)


def test_files_with_leading_trailing_spaces():
    # Should not match files with extra spaces
    files = [" file.txt", "file.txt ", "file.txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 4.50μs -> 4.08μs (10.2% faster)


def test_files_with_unicode_characters():
    # Handles unicode in filenames
    files = ["fïlè.txt", "fïlè (1).txt"]
    codeflash_output = _uniquity_file(files, "fïlè.txt")  # 6.21μs -> 6.00μs (3.48% faster)


def test_files_with_extension_only():
    # Handles files with only extension
    files = [".gitignore", ".gitignore (1)"]
    codeflash_output = _uniquity_file(files, ".gitignore")  # 5.54μs -> 5.46μs (1.52% faster)


def test_files_with_dotfile_and_extension():
    # Handles dotfiles with extension
    files = [".env", ".env (1)"]
    codeflash_output = _uniquity_file(files, ".env")  # 4.58μs -> 4.38μs (4.75% faster)


def test_files_with_multiple_extensions():
    # Handles files like "archive.tar.gz"
    files = ["archive.tar.gz", "archive.tar (1).gz"]
    codeflash_output = _uniquity_file(files, "archive.tar.gz")  # 7.12μs -> 6.92μs (3.01% faster)


def test_files_with_unusual_extension():
    # Handles files with unusual extensions
    files = ["file.weirdext", "file (1).weirdext"]
    codeflash_output = _uniquity_file(files, "file.weirdext")  # 6.42μs -> 6.12μs (4.77% faster)


# 3. Large Scale Test Cases


def test_large_number_of_duplicates():
    # Handles a large number of duplicates (up to 1000)
    files = ["file.txt"] + [f"file ({i}).txt" for i in range(1, 1000)]
    codeflash_output = _uniquity_file(files, "file.txt")  # 1.32ms -> 1.14ms (16.0% faster)


def test_large_gap_in_large_file_list():
    # Handles a large list with a gap in the sequence
    files = ["file.txt"] + [f"file ({i}).txt" for i in range(1, 500)] + ["file (501).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 661μs -> 570μs (16.0% faster)


def test_large_file_list_with_non_matching_files():
    # Handles large list with many non-matching files
    files = [f"otherfile ({i}).txt" for i in range(1, 1000)] + ["file.txt", "file (1).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 237μs -> 62.4μs (280% faster)


def test_large_file_list_with_gaps():
    # Handles large list with multiple gaps
    files = ["file.txt"] + [f"file ({i}).txt" for i in range(1, 1000) if i != 123]
    codeflash_output = _uniquity_file(files, "file.txt")  # 1.29ms -> 1.11ms (16.6% faster)


def test_large_file_list_with_similar_names():
    # Handles large list with similar but not matching names
    files = [f"file_{i}.txt" for i in range(1000)] + ["file.txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 307μs -> 120μs (154% faster)


# 4. Additional Robustness Cases


def test_files_with_mixed_case():
    # Should be case sensitive
    files = ["File.txt", "file.txt", "file (1).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 6.00μs -> 5.54μs (8.28% faster)


def test_files_with_special_characters():
    # Handles special characters in the filename
    files = ["f!l@e#.txt", "f!l@e# (1).txt"]
    codeflash_output = _uniquity_file(files, "f!l@e#.txt")  # 6.08μs -> 5.79μs (5.04% faster)


def test_files_with_long_extension():
    # Handles long extensions
    files = ["file.longextension", "file (1).longextension"]
    codeflash_output = _uniquity_file(
        files, "file.longextension"
    )  # 6.29μs -> 6.17μs (2.03% faster)


def test_files_with_multiple_gaps():
    # Should fill the lowest available number
    files = ["file.txt", "file (2).txt", "file (4).txt"]
    codeflash_output = _uniquity_file(files, "file.txt")  # 7.04μs -> 6.62μs (6.29% faster)


def test_files_with_non_ascii():
    # Handles non-ASCII characters
    files = ["файл.txt", "файл (1).txt"]
    codeflash_output = _uniquity_file(files, "файл.txt")  # 7.46μs -> 7.04μs (5.91% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import pytest

from unstructured.metrics.utils import _uniquity_file

# unit tests

# BASIC TEST CASES


def test_no_duplicates_returns_first_suffix():
    # No duplicates in the list, should return "filename (1).ext"
    file_list = []
    target = "filename.txt"
    expected = "filename (1).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 2.12μs -> 2.71μs (21.5% slower)


def test_one_duplicate_returns_next_suffix():
    # One duplicate, should return "filename (1).ext"
    file_list = ["filename.txt"]
    target = "filename.txt"
    expected = "filename (1).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 4.17μs -> 4.17μs (0.000% faster)


def test_multiple_duplicates_returns_next_available_suffix():
    # Several duplicates, should return "filename (3).ext"
    file_list = ["filename.txt", "filename (1).txt", "filename (2).txt"]
    target = "filename.txt"
    expected = "filename (3).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 7.71μs -> 7.17μs (7.56% faster)


def test_skipped_suffix_fills_gap():
    # Should fill the missing (2)
    file_list = ["filename.txt", "filename (1).txt", "filename (3).txt"]
    target = "filename.txt"
    expected = "filename (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 7.50μs -> 7.00μs (7.14% faster)


def test_non_matching_files_are_ignored():
    # Only matching pattern files are considered
    file_list = [
        "filename.txt",
        "otherfile.txt",
        "filename (1).txt",
        "filename (2).txt",
        "filename (1).doc",
    ]
    target = "filename.txt"
    expected = "filename (3).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 8.08μs -> 7.38μs (9.60% faster)


def test_different_extension_is_ignored():
    # Only files with the same extension are considered
    file_list = ["filename.txt", "filename (1).txt", "filename (2).doc"]
    target = "filename.txt"
    expected = "filename (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.21μs -> 5.83μs (6.43% faster)


def test_filename_with_spaces():
    # Filenames with spaces should be handled
    file_list = ["my file.txt", "my file (1).txt"]
    target = "my file.txt"
    expected = "my file (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.00μs -> 5.75μs (4.35% faster)


def test_filename_with_multiple_periods():
    # Filenames with multiple periods should be handled
    file_list = ["my.file.name.txt", "my.file.name (1).txt"]
    target = "my.file.name.txt"
    expected = "my.file.name (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.29μs -> 6.12μs (2.73% faster)


def test_filename_with_numbers_in_name():
    # Numbers in the filename (not in suffix) should be ignored for suffix logic
    file_list = ["file123.txt", "file123 (1).txt"]
    target = "file123.txt"
    expected = "file123 (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.25μs -> 6.08μs (2.75% faster)


def test_filename_with_parentheses_in_name():
    # Parentheses in the filename (not as suffix) should be handled
    file_list = ["file (draft).txt", "file (draft) (1).txt"]
    target = "file (draft).txt"
    expected = "file (draft) (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.42μs -> 6.21μs (3.37% faster)


# EDGE TEST CASES


def test_suffix_gap_at_start():
    # If (1) is missing, should fill it
    file_list = ["filename.txt", "filename (2).txt", "filename (3).txt"]
    target = "filename.txt"
    expected = "filename (1).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 7.33μs -> 6.96μs (5.39% faster)


def test_suffix_gap_in_middle():
    # If (2) is missing, should fill it
    file_list = ["filename.txt", "filename (1).txt", "filename (3).txt", "filename (4).txt"]
    target = "filename.txt"
    expected = "filename (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 8.83μs -> 8.17μs (8.15% faster)


def test_suffix_gap_at_end():
    # If (4) is missing, should fill it
    file_list = [
        "filename.txt",
        "filename (1).txt",
        "filename (2).txt",
        "filename (3).txt",
        "filename (5).txt",
    ]
    target = "filename.txt"
    expected = "filename (4).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 10.2μs -> 9.29μs (9.42% faster)


def test_large_suffix_number():
    # Should fill the lowest missing number, not just max+1
    file_list = ["filename.txt", "filename (1).txt", "filename (2).txt", "filename (100).txt"]
    target = "filename.txt"
    expected = "filename (3).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 8.83μs -> 8.21μs (7.60% faster)


def test_non_integer_suffix_ignored():
    # Non-integer suffixes should be ignored
    file_list = ["filename.txt", "filename (one).txt", "filename (1).txt"]
    target = "filename.txt"
    expected = "filename (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.12μs -> 5.75μs (6.52% faster)


def test_files_with_similar_names():
    # Only exact matches should be considered
    file_list = ["filename.txt", "filename (1).txt", "filename1.txt", "filename (1).doc"]
    target = "filename.txt"
    expected = "filename (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.62μs -> 5.92μs (12.0% faster)


def test_filename_with_unicode_characters():
    # Unicode in filenames should be handled
    file_list = ["fílênâmé.txt", "fílênâmé (1).txt"]
    target = "fílênâmé.txt"
    expected = "fílênâmé (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.54μs -> 6.21μs (5.35% faster)


def test_filename_with_no_extension():
    # No extension should still work
    file_list = ["filename", "filename (1)"]
    target = "filename"
    # Should raise ValueError because rsplit(".", 1) will fail
    with pytest.raises(ValueError):
        _uniquity_file(file_list, target)  # 1.00μs -> 1.00μs (0.000% faster)


def test_target_filename_with_multiple_dots():
    # Only the last dot is treated as extension
    file_list = ["my.file.name.txt", "my.file.name (1).txt"]
    target = "my.file.name.txt"
    expected = "my.file.name (2).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 6.46μs -> 6.21μs (4.01% faster)


def test_files_with_extra_spaces():
    # Extra spaces in filename should be treated as part of the name
    file_list = ["filename.txt", "filename  (1).txt"]
    target = "filename.txt"
    expected = "filename (1).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 4.38μs -> 4.17μs (5.02% faster)


# LARGE SCALE TEST CASES


def test_large_number_of_duplicates():
    # Should return the next available suffix after the largest
    file_list = ["filename.txt"] + [f"filename ({i}).txt" for i in range(1, 1000)]
    target = "filename.txt"
    expected = "filename (1000).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 1.37ms -> 1.19ms (15.1% faster)


def test_large_number_of_files_with_gap():
    # Should fill the first missing number
    file_list = (
        ["filename.txt"]
        + [f"filename ({i}).txt" for i in range(1, 500)]
        + [f"filename ({i}).txt" for i in range(501, 1001)]
    )
    target = "filename.txt"
    expected = "filename (500).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 1.35ms -> 1.18ms (15.1% faster)


def test_large_number_of_non_matching_files():
    # Should ignore non-matching files efficiently
    file_list = [f"otherfile ({i}).txt" for i in range(1, 1000)] + ["filename.txt"]
    target = "filename.txt"
    expected = "filename (1).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 237μs -> 60.8μs (290% faster)


def test_large_number_of_similar_but_not_matching_files():
    # Should ignore files with similar but not exact names
    file_list = [f"filename{i}.txt" for i in range(1, 1000)] + ["filename.txt"]
    target = "filename.txt"
    expected = "filename (1).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 328μs -> 128μs (155% faster)


def test_performance_with_mixed_large_files():
    # Should handle large mixed file lists efficiently
    file_list = (
        [f"filename ({i}).txt" for i in range(1, 500)]
        + [f"otherfile ({i}).txt" for i in range(1, 500)]
        + ["filename.txt"]
    )
    target = "filename.txt"
    expected = "filename (500).txt"
    codeflash_output = _uniquity_file(file_list, target)  # 803μs -> 625μs (28.6% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_uniquity_file-mjckqvak and push.

Codeflash Static Badge

The optimization achieves a **26% speedup** by making two key changes to regex handling in `_uniquity_file`:

**What was optimized:**
1. **Pre-compiled regex pattern**: Changed from `re.match(pattern, f)` to `pattern = re.compile(...); pattern.match(f)` - compiling the regex once upfront instead of recompiling it for every file in the list.
2. **Separate filtering and sorting**: Split the combined `sorted([f for f in file_list if re.match(pattern, f)], key=_sorting_key)` into two steps: first filter with list comprehension, then sort separately.

**Why this is faster:**
- **Regex compilation overhead eliminated**: The original code recompiled the same regex pattern for every file (potentially thousands of times). Pre-compiling saves this repeated work.
- **Better memory access patterns**: Separating filtering from sorting allows Python to optimize each operation independently, reducing temporary object creation during the combined operation.

**Performance impact analysis:**
From the line profiler, the critical line went from 31.9ms (75.3% of total time) to 16.1ms + 2.9ms = 19.0ms total (54.8% of total time) - a **40% improvement** on the hottest code path.

**When this optimization matters most:**
Based on the annotated tests, the biggest gains occur with:
- Large file lists with many non-matching files (280-290% speedup)
- Lists with 500-1000+ duplicates (15-16% speedup)
- Mixed scenarios with both matching and non-matching files (28% speedup)

**Function context impact:**
Since `_uniquity_file` is called by `_get_non_duplicated_filename` which processes directory listings, this optimization will significantly improve performance when dealing with directories containing many files, making file deduplication operations much faster in real-world scenarios.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 19, 2025 07:55
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant