⚡️ Speed up function `_format_grouping_output` by 21% #17

codeflash-ai · 2025-12-19T07:41:49Z

📄 21% (0.21x) speedup for `_format_grouping_output` in `unstructured/metrics/utils.py`

⏱️ Runtime : 5.81 milliseconds → 4.80 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization improves performance by avoiding the expensive reset_index() operation in the common case where DataFrames have default RangeIndex structures.

Key optimizations applied:

Fast path detection: Checks if all DataFrames have default RangeIndex (starting from 0, step 1) which allows skipping the reset_index() overhead
Manual index insertion: Instead of calling reset_index(), manually inserts an 'index' column using result.insert(0, 'index', range(len(result))), which is significantly faster
Graceful fallback: Uses try-catch to fall back to original behavior for edge cases

Why this leads to speedup:

reset_index() creates a new DataFrame and copies all data, while manual index insertion only adds one column
The optimization path avoids pandas' internal index reconstruction logic
Range generation is faster than DataFrame reconstruction

Impact on workloads:

Based on function_references, this function is called from get_mean_grouping() in a metrics evaluation pipeline. The 21% speedup will be particularly beneficial when:

Processing multiple metric aggregations (the function loops through agg_fields)
Working with large datasets in evaluation workflows
Running batch evaluations where this function is called repeatedly

Test case performance:

The optimization excels with:

Simple DataFrames with default indexes: 30-47% speedup for basic concatenation cases
Large datasets: 31.9% improvement with 1000 rows, 29% with many DataFrames
Mixed data types: 47.4% speedup maintained even with different column types

Preserved compatibility:

DataFrames with custom indexes gracefully fall back to original behavior (minimal slowdown)
Error cases and edge conditions maintain identical behavior
All data types, NaN values, and MultiIndex cases are handled correctly through the fallback path

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 30 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pandas as pd

# imports
import pytest  # used for our unit tests

from unstructured.metrics.utils import _format_grouping_output

# unit tests

# ---------------- BASIC TEST CASES ----------------


def test_single_dataframe():
    # Test with a single DataFrame input
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 147μs -> 128μs (14.4% faster)
    # Should add a new index column at the start
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_two_dataframes_same_length():
    # Test with two DataFrames of the same length
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 179μs -> 137μs (30.6% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_three_dataframes():
    # Test with three DataFrames
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    df3 = pd.DataFrame({"c": [5, 6]})
    codeflash_output = _format_grouping_output(df1, df2, df3)
    result = codeflash_output  # 186μs -> 142μs (31.0% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4], "c": [5, 6]})


def test_column_name_conflict():
    # Test with two DataFrames having the same column name
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"a": [3, 4]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 174μs -> 136μs (28.0% faster)
    # The columns should be ('a', 'a'), which pandas will handle by making them duplicate columns
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "a": [3, 4]})


def test_empty_dataframe():
    # Test with an empty DataFrame
    df1 = pd.DataFrame({"a": []})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 127μs -> 111μs (14.7% faster)
    expected = pd.DataFrame({"index": [], "a": []})


def test_dataframe_with_index():
    # Test DataFrame with a custom index
    df1 = pd.DataFrame({"a": [1, 2]}, index=["x", "y"])
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 134μs -> 137μs (2.28% slower)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2]})


# ---------------- EDGE TEST CASES ----------------


def test_no_arguments():
    # Test when no DataFrames are passed (should raise ValueError)
    with pytest.raises(ValueError):
        _format_grouping_output()  # 3.71μs -> 3.50μs (5.94% faster)


def test_non_dataframe_argument():
    # Test with a non-DataFrame argument (should raise TypeError)
    df1 = pd.DataFrame({"a": [1, 2]})
    not_a_df = [1, 2]
    with pytest.raises(TypeError):
        _format_grouping_output(df1, not_a_df)  # 6.71μs -> 8.08μs (17.0% slower)


def test_dataframe_with_multiindex():
    # Test with DataFrame with MultiIndex
    arrays = [["bar", "baz"], ["one", "two"]]
    index = pd.MultiIndex.from_arrays(arrays, names=("first", "second"))
    df1 = pd.DataFrame({"a": [1, 2]}, index=index)
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 269μs -> 266μs (1.11% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2]})


def test_dataframe_with_nan_values():
    # Test with DataFrame containing NaN values
    df1 = pd.DataFrame({"a": [1, None]})
    df2 = pd.DataFrame({"b": [None, 2]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 177μs -> 135μs (31.5% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, None], "b": [None, 2]})


def test_dataframe_with_object_columns():
    # Test with object dtype columns (strings)
    df1 = pd.DataFrame({"a": ["foo", "bar"]})
    df2 = pd.DataFrame({"b": ["baz", "qux"]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 166μs -> 129μs (28.3% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": ["foo", "bar"], "b": ["baz", "qux"]})


def test_dataframe_with_all_none():
    # Test with DataFrame where all values are None
    df1 = pd.DataFrame({"a": [None, None]})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 127μs -> 110μs (14.8% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [None, None]})


# ---------------- LARGE SCALE TEST CASES ----------------


def test_large_number_of_rows():
    # Test with a large number of rows (1000)
    df1 = pd.DataFrame({"a": list(range(1000))})
    df2 = pd.DataFrame({"b": list(range(1000, 2000))})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 169μs -> 129μs (31.3% faster)


def test_large_number_of_columns():
    # Test with a large number of DataFrames (columns)
    dfs = [pd.DataFrame({f"col{i}": [i, i + 1]}) for i in range(100)]
    codeflash_output = _format_grouping_output(*dfs)
    result = codeflash_output  # 1.38ms -> 1.10ms (26.1% faster)


def test_large_mixed_types():
    # Test with large DataFrames with mixed types
    df1 = pd.DataFrame({"a": list(range(1000))})
    df2 = pd.DataFrame({"b": ["x"] * 1000})
    df3 = pd.DataFrame({"c": [None] * 1000})
    codeflash_output = _format_grouping_output(df1, df2, df3)
    result = codeflash_output  # 230μs -> 156μs (47.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pandas as pd

# imports
import pytest  # used for our unit tests

from unstructured.metrics.utils import _format_grouping_output

# unit tests

# 1. Basic Test Cases


def test_single_dataframe():
    # Test with a single DataFrame
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 135μs -> 115μs (16.8% faster)
    # Should add an 'index' column and keep all data
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_two_dataframes_same_length():
    # Test with two DataFrames of same length
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 168μs -> 129μs (30.4% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4]})


def test_three_dataframes():
    # Test with three DataFrames
    df1 = pd.DataFrame({"a": [1, 2]})
    df2 = pd.DataFrame({"b": [3, 4]})
    df3 = pd.DataFrame({"c": [5, 6]})
    codeflash_output = _format_grouping_output(df1, df2, df3)
    result = codeflash_output  # 180μs -> 136μs (31.7% faster)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4], "c": [5, 6]})


def test_column_name_collision():
    # Test with DataFrames that have overlapping column names
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
    df2 = pd.DataFrame({"a": [5, 6], "c": [7, 8]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 174μs -> 135μs (29.1% faster)
    # The columns with the same name should be suffixed automatically
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2], "b": [3, 4], "a": [5, 6], "c": [7, 8]})


def test_empty_dataframes():
    # Test with empty DataFrames (0 rows, but with columns)
    df1 = pd.DataFrame({"a": []})
    df2 = pd.DataFrame({"b": []})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 163μs -> 126μs (29.7% faster)
    expected = pd.DataFrame({"index": [], "a": [], "b": []})


def test_dataframe_with_index():
    # DataFrame with a custom index
    df1 = pd.DataFrame({"a": [1, 2]}, index=["x", "y"])
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 134μs -> 135μs (0.861% slower)
    expected = pd.DataFrame({"index": ["x", "y"], "a": [1, 2]})


# 2. Edge Test Cases


def test_no_arguments():
    # Should raise ValueError if no DataFrames are passed
    with pytest.raises(ValueError):
        _format_grouping_output()  # 3.58μs -> 3.38μs (6.19% faster)


def test_non_dataframe_argument():
    # Passing a non-DataFrame should raise an error
    df1 = pd.DataFrame({"a": [1, 2]})
    not_a_df = [1, 2]
    with pytest.raises(TypeError):
        _format_grouping_output(df1, not_a_df)  # 6.83μs -> 8.08μs (15.5% slower)


def test_dataframe_with_no_columns():
    # DataFrame with no columns but with rows
    df1 = pd.DataFrame(index=[0, 1])
    df2 = pd.DataFrame({"a": [1, 2]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 190μs -> 190μs (0.175% slower)
    expected = pd.DataFrame({"index": [0, 1], "a": [1, 2]})


def test_dataframe_with_multiindex():
    # DataFrame with MultiIndex
    idx = pd.MultiIndex.from_tuples([("a", 1), ("b", 2)], names=["x", "y"])
    df1 = pd.DataFrame({"val": [10, 20]}, index=idx)
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 258μs -> 258μs (0.113% faster)
    expected = df1.reset_index()


def test_dataframe_with_duplicate_columns():
    # DataFrame with duplicate column names
    df1 = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "a"])
    codeflash_output = _format_grouping_output(df1)
    result = codeflash_output  # 136μs -> 118μs (14.6% faster)
    expected = df1.reset_index()


def test_dataframe_with_nan_values():
    # DataFrame with NaN values
    df1 = pd.DataFrame({"a": [1, None, 3]})
    df2 = pd.DataFrame({"b": [4, 5, None]})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 170μs -> 129μs (31.6% faster)
    expected = pd.DataFrame({"index": [0, 1, 2], "a": [1, None, 3], "b": [4, 5, None]})


# 3. Large Scale Test Cases


def test_large_number_of_rows():
    # Test with a large DataFrame (1000 rows)
    n = 1000
    df1 = pd.DataFrame({"a": range(n)})
    df2 = pd.DataFrame({"b": range(n, 2 * n)})
    codeflash_output = _format_grouping_output(df1, df2)
    result = codeflash_output  # 169μs -> 128μs (31.9% faster)
    expected = pd.DataFrame({"index": range(n), "a": range(n), "b": range(n, 2 * n)})


def test_many_dataframes():
    # Test with many DataFrames (10 DataFrames of 100 rows)
    n = 100
    dfs = [pd.DataFrame({f"col{i}": range(n)}) for i in range(10)]
    codeflash_output = _format_grouping_output(*dfs)
    result = codeflash_output  # 272μs -> 211μs (29.0% faster)
    # Build expected DataFrame
    expected_dict = {"index": range(n)}
    for i in range(10):
        expected_dict[f"col{i}"] = range(n)
    expected = pd.DataFrame(expected_dict)


def test_large_number_of_columns():
    # Test with a DataFrame with many columns (500 columns)
    n = 10
    cols = {f"col{i}": range(n) for i in range(500)}
    df = pd.DataFrame(cols)
    codeflash_output = _format_grouping_output(df)
    result = codeflash_output  # 159μs -> 137μs (16.8% faster)
    expected = df.reset_index()

import pytest

from unstructured.metrics.utils import _format_grouping_output


def test__format_grouping_output():
    with pytest.raises(ValueError, match="No\\ objects\\ to\\ concatenate"):
        _format_grouping_output()

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_e8goshnj/tmprnu9hz9r/test_concolic_coverage.py::test__format_grouping_output`	3.92μs	3.92μs	0.000%✅

To edit these changes git checkout codeflash/optimize-_format_grouping_output-mjck9hzq and push.

The optimization improves performance by **avoiding the expensive `reset_index()` operation** in the common case where DataFrames have default RangeIndex structures. **Key optimizations applied:** 1. **Fast path detection**: Checks if all DataFrames have default RangeIndex (starting from 0, step 1) which allows skipping the `reset_index()` overhead 2. **Manual index insertion**: Instead of calling `reset_index()`, manually inserts an 'index' column using `result.insert(0, 'index', range(len(result)))`, which is significantly faster 3. **Graceful fallback**: Uses try-catch to fall back to original behavior for edge cases **Why this leads to speedup:** - `reset_index()` creates a new DataFrame and copies all data, while manual index insertion only adds one column - The optimization path avoids pandas' internal index reconstruction logic - Range generation is faster than DataFrame reconstruction **Impact on workloads:** Based on function_references, this function is called from `get_mean_grouping()` in a metrics evaluation pipeline. The 21% speedup will be particularly beneficial when: - Processing multiple metric aggregations (the function loops through `agg_fields`) - Working with large datasets in evaluation workflows - Running batch evaluations where this function is called repeatedly **Test case performance:** The optimization excels with: - **Simple DataFrames with default indexes**: 30-47% speedup for basic concatenation cases - **Large datasets**: 31.9% improvement with 1000 rows, 29% with many DataFrames - **Mixed data types**: 47.4% speedup maintained even with different column types **Preserved compatibility:** - DataFrames with custom indexes gracefully fall back to original behavior (minimal slowdown) - Error cases and edge conditions maintain identical behavior - All data types, NaN values, and MultiIndex cases are handled correctly through the fallback path

codeflash-ai bot requested a review from aseembits93 December 19, 2025 07:41

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_format_grouping_output` by 21% #17

⚡️ Speed up function `_format_grouping_output` by 21% #17

codeflash-ai bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _format_grouping_output by 21% #17

Are you sure you want to change the base?

⚡️ Speed up function _format_grouping_output by 21% #17

Conversation

codeflash-ai bot commented Dec 19, 2025

📄 21% (0.21x) speedup for _format_grouping_output in unstructured/metrics/utils.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_format_grouping_output` by 21% #17

⚡️ Speed up function `_format_grouping_output` by 21% #17

📄 21% (0.21x) speedup for `_format_grouping_output` in `unstructured/metrics/utils.py`