Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 19, 2025

📄 21% (0.21x) speedup for _format_grouping_output in unstructured/metrics/utils.py

⏱️ Runtime : 5.81 milliseconds 4.80 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization improves performance by avoiding the expensive reset_index() operation in the common case where DataFrames have default RangeIndex structures.

Key optimizations applied:

  1. Fast path detection: Checks if all DataFrames have default RangeIndex (starting from 0, step 1) which allows skipping the reset_index() overhead
  2. Manual index insertion: Instead of calling reset_index(), manually inserts an 'index' column using result.insert(0, 'index', range(len(result))), which is significantly faster
  3. Graceful fallback: Uses try-catch to fall back to original behavior for edge cases

Why this leads to speedup:

  • reset_index() creates a new DataFrame and copies all data, while manual index insertion only adds one column
  • The optimization path avoids pandas' internal index reconstruction logic
  • Range generation is faster than DataFrame reconstruction

Impact on workloads:

Based on function_references, this function is called from get_mean_grouping() in a metrics evaluation pipeline. The 21% speedup will be particularly beneficial when:

  • Processing multiple metric aggregations (the function loops through agg_fields)
  • Working with large datasets in evaluation workflows
  • Running batch evaluations where this function is called repeatedly

Test case performance:

The optimization excels with:

  • Simple DataFrames with default indexes: 30-47% speedup for basic concatenation cases
  • Large datasets: 31.9% improvement with 1000 rows, 29% with many DataFrames
  • Mixed data types: 47.4% speedup maintained even with different column types

Preserved compatibility:

  • DataFrames with custom indexes gracefully fall back to original behavior (minimal slowdown)
  • Error cases and edge conditions maintain identical behavior
  • All data types, NaN values, and MultiIndex cases are handled correctly through the fallback path

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 30 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd

# imports
import pytest  # used for our unit tests

from unstructured.metrics.utils import _format_grouping_output

# unit tests

# ---------------- BASIC TEST CASES ----------------


def test_single_dataframe():
    # Test with a single DataFrame input
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 147μs -> 128μs (14.4% faster)
    # Should add a new index column at the start
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_two_dataframes_same_length():
    # Test with two DataFrames of the same length
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 179μs -> 137μs (30.6% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_three_dataframes():
    # Test with three DataFrames
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    df3 = pd.DataFrame({"c": [5, 6]})
    codeflash_output = _format_grouping_output(df1, df2, df3)
    result = codeflash_output  # 186μs -> 142μs (31.0% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4], "c": [5, 6]})


def test_column_name_conflict():
    # Test with two DataFrames having the same column name
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"a": [3, 4]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 174μs -> 136μs (28.0% faster)
    # The columns should be ('a', 'a'), which pandas will handle by making them duplicate columns
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "a": [3, 4]})


def test_empty_dataframe():
    # Test with an empty DataFrame
    df1 = pd.DataFrame({"a": []})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 127μs -> 111μs (14.7% faster)
    expected = pd.DataFrame({"index": [], "a": []})


def test_dataframe_with_index():
    # Test DataFrame with a custom index
    df1 = pd.DataFrame({"a": [1, 2]}, index=["x", "y"])
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 134μs -> 137μs (2.28% slower)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2]})


# ---------------- EDGE TEST CASES ----------------


def test_no_arguments():
    # Test when no DataFrames are passed (should raise ValueError)
    with pytest.raises(ValueError):
        _format_grouping_output()  # 3.71μs -> 3.50μs (5.94% faster)


def test_non_dataframe_argument():
    # Test with a non-DataFrame argument (should raise TypeError)
    df1 = pd.DataFrame({"a": [1, 2]})
    not_a_df = [1, 2]
    with pytest.raises(TypeError):
        _format_grouping_output(df1, not_a_df)  # 6.71μs -> 8.08μs (17.0% slower)


def test_dataframe_with_multiindex():
    # Test with DataFrame with MultiIndex
    arrays = [["bar", "baz"], ["one", "two"]]
    index = pd.MultiIndex.from_arrays(arrays, names=("first", "second"))
    df1 = pd.DataFrame({"a": [1, 2]}, index=index)
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 269μs -> 266μs (1.11% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2]})


def test_dataframe_with_nan_values():
    # Test with DataFrame containing NaN values
    df1 = pd.DataFrame({"a": [1, None]})
    df2 = pd.DataFrame({"b": [None, 2]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 177μs -> 135μs (31.5% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, None], "b": [None, 2]})


def test_dataframe_with_object_columns():
    # Test with object dtype columns (strings)
    df1 = pd.DataFrame({"a": ["foo", "bar"]})
    df2 = pd.DataFrame({"b": ["baz", "qux"]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 166μs -> 129μs (28.3% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": ["foo", "bar"], "b": ["baz", "qux"]})


def test_dataframe_with_all_none():
    # Test with DataFrame where all values are None
    df1 = pd.DataFrame({"a": [None, None]})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 127μs -> 110μs (14.8% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [None, None]})


# ---------------- LARGE SCALE TEST CASES ----------------


def test_large_number_of_rows():
    # Test with a large number of rows (1000)
    df1 = pd.DataFrame({"a": list(range(1000))})
    df2 = pd.DataFrame({"b": list(range(1000, 2000))})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 169μs -> 129μs (31.3% faster)


def test_large_number_of_columns():
    # Test with a large number of DataFrames (columns)
    dfs = [pd.DataFrame({f"col{i}": [i, i + 1]}) for i in range(100)]
    codeflash_output = _format_grouping_output(*dfs)
    result = codeflash_output  # 1.38ms -> 1.10ms (26.1% faster)


def test_large_mixed_types():
    # Test with large DataFrames with mixed types
    df1 = pd.DataFrame({"a": list(range(1000))})
    df2 = pd.DataFrame({"b": ["x"] * 1000})
    df3 = pd.DataFrame({"c": [None] * 1000})
    codeflash_output = _format_grouping_output(df1, df2, df3)
    result = codeflash_output  # 230μs -> 156μs (47.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

# imports
import pytest  # used for our unit tests

from unstructured.metrics.utils import _format_grouping_output

# unit tests

# 1. Basic Test Cases


def test_single_dataframe():
    # Test with a single DataFrame
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 135μs -> 115μs (16.8% faster)
    # Should add an 'index' column and keep all data
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_two_dataframes_same_length():
    # Test with two DataFrames of same length
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 168μs -> 129μs (30.4% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_three_dataframes():
    # Test with three DataFrames
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    df3 = pd.DataFrame({"c": [5, 6]})
    codeflash_output = _format_grouping_output(df1, df2, df3)
    result = codeflash_output  # 180μs -> 136μs (31.7% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4], "c": [5, 6]})


def test_column_name_collision():
    # Test with DataFrames that have overlapping column names
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
    df2 = pd.DataFrame({"a": [5, 6], "c": [7, 8]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 174μs -> 135μs (29.1% faster)
    # The columns with the same name should be suffixed automatically
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4], "a": [5, 6], "c": [7, 8]})


def test_empty_dataframes():
    # Test with empty DataFrames (0 rows, but with columns)
    df1 = pd.DataFrame({"a": []})
    df2 = pd.DataFrame({"b": []})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 163μs -> 126μs (29.7% faster)
    expected = pd.DataFrame({"index": [], "a": [], "b": []})


def test_dataframe_with_index():
    # DataFrame with a custom index
    df1 = pd.DataFrame({"a": [1, 2]}, index=["x", "y"])
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 134μs -> 135μs (0.861% slower)
    expected = pd.DataFrame({"index": ["x", "y"], "a": [1, 2]})


# 2. Edge Test Cases


def test_no_arguments():
    # Should raise ValueError if no DataFrames are passed
    with pytest.raises(ValueError):
        _format_grouping_output()  # 3.58μs -> 3.38μs (6.19% faster)


def test_non_dataframe_argument():
    # Passing a non-DataFrame should raise an error
    df1 = pd.DataFrame({"a": [1, 2]})
    not_a_df = [1, 2]
    with pytest.raises(TypeError):
        _format_grouping_output(df1, not_a_df)  # 6.83μs -> 8.08μs (15.5% slower)


def test_dataframe_with_no_columns():
    # DataFrame with no columns but with rows
    df1 = pd.DataFrame(index=[0, 1])
    df2 = pd.DataFrame({"a": [1, 2]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 190μs -> 190μs (0.175% slower)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2]})


def test_dataframe_with_multiindex():
    # DataFrame with MultiIndex
    idx = pd.MultiIndex.from_tuples([("a", 1), ("b", 2)], names=["x", "y"])
    df1 = pd.DataFrame({"val": [10, 20]}, index=idx)
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 258μs -> 258μs (0.113% faster)
    expected = df1.reset_index()


def test_dataframe_with_duplicate_columns():
    # DataFrame with duplicate column names
    df1 = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "a"])
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 136μs -> 118μs (14.6% faster)
    expected = df1.reset_index()


def test_dataframe_with_nan_values():
    # DataFrame with NaN values
    df1 = pd.DataFrame({"a": [1, None, 3]})
    df2 = pd.DataFrame({"b": [4, 5, None]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 170μs -> 129μs (31.6% faster)
    expected = pd.DataFrame({"index": [0, 1, 2], "a": [1, None, 3], "b": [4, 5, None]})


# 3. Large Scale Test Cases


def test_large_number_of_rows():
    # Test with a large DataFrame (1000 rows)
    n = 1000
    df1 = pd.DataFrame({"a": range(n)})
    df2 = pd.DataFrame({"b": range(n, 2 * n)})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 169μs -> 128μs (31.9% faster)
    expected = pd.DataFrame({"index": range(n), "a": range(n), "b": range(n, 2 * n)})


def test_many_dataframes():
    # Test with many DataFrames (10 DataFrames of 100 rows)
    n = 100
    dfs = [pd.DataFrame({f"col{i}": range(n)}) for i in range(10)]
    codeflash_output = _format_grouping_output(*dfs)
    result = codeflash_output  # 272μs -> 211μs (29.0% faster)
    # Build expected DataFrame
    expected_dict = {"index": range(n)}
    for i in range(10):
        expected_dict[f"col{i}"] = range(n)
    expected = pd.DataFrame(expected_dict)


def test_large_number_of_columns():
    # Test with a DataFrame with many columns (500 columns)
    n = 10
    cols = {f"col{i}": range(n) for i in range(500)}
    df = pd.DataFrame(cols)
    codeflash_output = _format_grouping_output(df)
    result = codeflash_output  # 159μs -> 137μs (16.8% faster)
    expected = df.reset_index()
import pytest

from unstructured.metrics.utils import _format_grouping_output


def test__format_grouping_output():
    with pytest.raises(ValueError, match="No\\ objects\\ to\\ concatenate"):
        _format_grouping_output()
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_e8goshnj/tmprnu9hz9r/test_concolic_coverage.py::test__format_grouping_output 3.92μs 3.92μs 0.000%✅

To edit these changes git checkout codeflash/optimize-_format_grouping_output-mjck9hzq and push.

Codeflash Static Badge

The optimization improves performance by **avoiding the expensive `reset_index()` operation** in the common case where DataFrames have default RangeIndex structures.

**Key optimizations applied:**

1. **Fast path detection**: Checks if all DataFrames have default RangeIndex (starting from 0, step 1) which allows skipping the `reset_index()` overhead
2. **Manual index insertion**: Instead of calling `reset_index()`, manually inserts an 'index' column using `result.insert(0, 'index', range(len(result)))`, which is significantly faster
3. **Graceful fallback**: Uses try-catch to fall back to original behavior for edge cases

**Why this leads to speedup:**

- `reset_index()` creates a new DataFrame and copies all data, while manual index insertion only adds one column
- The optimization path avoids pandas' internal index reconstruction logic
- Range generation is faster than DataFrame reconstruction

**Impact on workloads:**

Based on function_references, this function is called from `get_mean_grouping()` in a metrics evaluation pipeline. The 21% speedup will be particularly beneficial when:
- Processing multiple metric aggregations (the function loops through `agg_fields`)
- Working with large datasets in evaluation workflows
- Running batch evaluations where this function is called repeatedly

**Test case performance:**

The optimization excels with:
- **Simple DataFrames with default indexes**: 30-47% speedup for basic concatenation cases
- **Large datasets**: 31.9% improvement with 1000 rows, 29% with many DataFrames
- **Mixed data types**: 47.4% speedup maintained even with different column types

**Preserved compatibility:**

- DataFrames with custom indexes gracefully fall back to original behavior (minimal slowdown)
- Error cases and edge conditions maintain identical behavior
- All data types, NaN values, and MultiIndex cases are handled correctly through the fallback path
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 19, 2025 07:41
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant