Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 19, 2025

📄 99% (0.99x) speedup for ObjectDetectionEvalProcessor._box_iou in unstructured/metrics/object_detection.py

⏱️ Runtime : 3.20 milliseconds 1.61 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces PyTorch's vectorized IoU calculation with a Numba-compiled implementation that provides significant speedup for this specific workload.

Key Changes:

  • Numba JIT compilation: Added @njit(fastmath=True, cache=True) decorator to create a compiled numpy-based IoU function that eliminates Python overhead
  • Explicit nested loops: Replaced PyTorch's broadcasting operations with explicit loops that are highly optimized by Numba's compiler
  • Data conversion: Converts PyTorch tensors to numpy arrays, processes with Numba, then converts back

Why It's Faster:

  1. Eliminated broadcasting overhead: The original code used PyTorch's [:, None, 2:] broadcasting which creates large intermediate tensors. The optimized version uses direct indexing in compiled loops
  2. Reduced memory allocations: Numba's compiled code avoids creating multiple intermediate tensors for min/max/clamp operations
  3. JIT compilation benefits: The fastmath=True flag enables aggressive floating-point optimizations, while cache=True ensures compilation happens only once

Performance Profile:

  • Small batches (1-4 boxes): 5-6x speedup - conversion overhead is minimal compared to computation savings
  • Medium batches (100 boxes): 3-5x speedup - optimal sweet spot for this approach
  • Large batches (500x500): 1.5x speedup - still beneficial but diminishing returns due to conversion costs

The optimization is particularly effective for object detection evaluation pipelines where IoU calculations are performed repeatedly on moderately-sized batches of bounding boxes, which is the typical use case for this metrics module.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import torch

from unstructured.metrics.object_detection import ObjectDetectionEvalProcessor

# function to test
# (The ObjectDetectionEvalProcessor class is defined above, including _box_iou.)

# unit tests


class TestBoxIoU:
    # 1. Basic Test Cases

    def test_single_identical_box(self):
        # Two identical boxes, IoU should be 1
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 32.6μs -> 5.38μs (506% faster)

    def test_single_non_overlapping_boxes(self):
        # Two non-overlapping boxes, IoU should be 0
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[20, 20, 30, 30]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.7μs -> 5.17μs (514% faster)

    def test_partial_overlap(self):
        # Boxes partially overlap
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[5, 5, 15, 15]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.3μs -> 5.12μs (511% faster)
        # The overlap area is 5x5=25, total area is 100+100-25=175
        expected_iou = 25 / 175

    def test_multiple_boxes(self):
        # Multiple boxes in box1 and box2
        box1 = torch.tensor([[0, 0, 10, 10], [5, 5, 15, 15]])
        box2 = torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 33.1μs -> 5.29μs (526% faster)
        # iou[0,0]=1, iou[0,1]=0, iou[1,0]=25/175, iou[1,1]=25/175
        expected = torch.tensor([[1.0, 0.0], [25 / 175, 25 / 175]])

    def test_zero_area_box(self):
        # A box with zero area (x1==x2 or y1==y2)
        box1 = torch.tensor([[0, 0, 0, 10]])  # zero width
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.5μs -> 5.21μs (504% faster)

    # 2. Edge Test Cases

    def test_empty_boxes(self):
        # One or both inputs are empty
        box1 = torch.empty((0, 4))
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 33.5μs -> 5.71μs (488% faster)

        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.empty((0, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 30.3μs -> 4.42μs (587% faster)

        box1 = torch.empty((0, 4))
        box2 = torch.empty((0, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 24.9μs -> 4.08μs (510% faster)

    def test_negative_coordinates(self):
        # Boxes with negative coordinates
        box1 = torch.tensor([[-10, -10, 0, 0]])
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.5μs -> 5.21μs (505% faster)

        # Overlapping negative and positive
        box1 = torch.tensor([[-5, -5, 5, 5]])
        box2 = torch.tensor([[0, 0, 10, 10]])
        # Overlap area: (5-0)*(5-0)=25, area1=100, area2=100, union=100+100-25=175
        expected_iou = 25 / 175
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 28.7μs -> 3.92μs (633% faster)

    def test_float_coordinates(self):
        # Boxes with float coordinates
        box1 = torch.tensor([[0.0, 0.0, 10.5, 10.5]])
        box2 = torch.tensor([[5.25, 5.25, 15.75, 15.75]])
        # Overlap area: (10.5-5.25)*(10.5-5.25)=5.25*5.25=27.5625
        # area1=10.5*10.5=110.25, area2=10.5*10.5=110.25, union=110.25+110.25-27.5625=192.9375
        expected_iou = 27.5625 / 192.9375
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 29.9μs -> 5.21μs (474% faster)

    def test_touching_boxes(self):
        # Boxes that just touch at the edge, no overlap
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[10, 0, 20, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.2μs -> 5.12μs (510% faster)

        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[0, 10, 10, 20]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 28.5μs -> 3.88μs (634% faster)

    def test_invalid_box_coordinates(self):
        # x2 < x1 or y2 < y1 (invalid box)
        box1 = torch.tensor([[10, 10, 0, 0]])  # x2 < x1, y2 < y1
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 30.8μs -> 5.08μs (507% faster)

    def test_large_coordinates(self):
        # Very large coordinates
        box1 = torch.tensor([[0, 0, 1_000_000, 1_000_000]])
        box2 = torch.tensor([[500_000, 500_000, 1_500_000, 1_500_000]])
        # Overlap area: (1_000_000-500_000)^2 = 500_000^2 = 250_000_000_000
        # area1 = 1_000_000^2 = 1_000_000_000_000
        # area2 = 1_000_000^2 = 1_000_000_000_000
        # union = 1_000_000_000_000 + 1_000_000_000_000 - 250_000_000_000 = 1_750_000_000_000
        expected_iou = 250_000_000_000 / 1_750_000_000_000
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 30.9μs -> 5.04μs (512% faster)

    # 3. Large Scale Test Cases

    def test_large_number_of_boxes(self):
        # Test with 500 boxes in box1 and 500 in box2
        N, M = 500, 500
        # Boxes: each box is [i, i, i+10, i+10]
        box1 = torch.stack(
            [torch.tensor([i, i, i + 10, i + 10], dtype=torch.float32) for i in range(N)]
        )
        box2 = torch.stack(
            [torch.tensor([i, i, i + 10, i + 10], dtype=torch.float32) for i in range(M)]
        )
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 627μs -> 414μs (51.4% faster)

    def test_large_batch_with_partial_overlap(self):
        # 100 boxes, all overlap with a central box
        N = 100
        box1 = torch.stack(
            [torch.tensor([i, i, i + 10, i + 10], dtype=torch.float32) for i in range(N)]
        )
        # Central box overlaps with all
        box2 = torch.tensor([[5, 5, 15, 15]], dtype=torch.float32)
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 35.8μs -> 5.54μs (546% faster)
        # Only boxes that overlap with [5,5,15,15] should have nonzero IoU
        for idx in range(N):
            # Overlap if [i, i, i+10, i+10] intersects [5,5,15,15]
            x1 = max(idx, 5)
            y1 = max(idx, 5)
            x2 = min(idx + 10, 15)
            y2 = min(idx + 10, 15)
            inter_w = max(0, x2 - x1)
            inter_h = max(0, y2 - y1)
            inter = inter_w * inter_h
            area1 = 10 * 10
            area2 = 10 * 10
            union = area1 + area2 - inter
            expected_iou = inter / union if union > 0 else 0

    def test_large_empty_boxes(self):
        # Large empty input
        box1 = torch.empty((0, 4))
        box2 = torch.empty((500, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 30.3μs -> 6.04μs (402% faster)

        box1 = torch.empty((500, 4))
        box2 = torch.empty((0, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 26.4μs -> 4.29μs (515% faster)

    def test_maximum_tensor_size(self):
        # Test with tensors close to the 100MB limit
        # Each float32 is 4 bytes, so 100MB/4 = 25,000,000 elements
        # For shape (N, M), N*M <= 25,000,000; pick N=M=500 (500*500=250,000 elements, ~1MB)
        N = 500
        box1 = torch.rand((N, 4))
        box2 = torch.rand((N, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 627μs -> 538μs (16.5% faster)

    # 4. Determinism and Consistency

    def test_determinism(self):
        # IoU should be deterministic for same input
        box1 = torch.tensor([[0, 0, 10, 10], [5, 5, 15, 15]], dtype=torch.float32)
        box2 = torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]], dtype=torch.float32)
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou1 = codeflash_output  # 35.2μs -> 5.38μs (555% faster)
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou2 = codeflash_output  # 27.9μs -> 3.75μs (643% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import torch

from unstructured.metrics.object_detection import ObjectDetectionEvalProcessor

# function to test
# (The function ObjectDetectionEvalProcessor._box_iou is defined above.)

# ---------------------------
# Unit tests for _box_iou
# ---------------------------


class TestBoxIoU:
    # ---------------------------
    # 1. Basic Test Cases
    # ---------------------------

    def test_identical_boxes(self):
        # Test that identical boxes have IoU of 1
        box = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box, box)
        iou = codeflash_output  # 34.7μs -> 5.54μs (526% faster)

    def test_non_overlapping_boxes(self):
        # Test that non-overlapping boxes have IoU of 0
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[20, 20, 30, 30]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.8μs -> 5.21μs (511% faster)

    def test_partial_overlap(self):
        # Test that partially overlapping boxes have correct IoU
        # box1: (0,0,10,10), box2: (5,5,15,15)
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[5, 5, 15, 15]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.5μs -> 5.12μs (514% faster)
        # Intersection area: (5,5)-(10,10) = 5x5=25
        # Area1 = 100, Area2 = 100, union = 100+100-25=175, IoU = 25/175
        expected = 25.0 / 175.0

    def test_multiple_boxes(self):
        # Test multiple boxes in both inputs
        box1 = torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20]])
        box2 = torch.tensor([[0, 0, 10, 10], [5, 5, 15, 15]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 33.2μs -> 5.17μs (542% faster)
        # iou[0,0] = 1, iou[0,1] = 25/175, iou[1,0] = 0, iou[1,1] = 25/175
        expected = torch.tensor([[1.0, 25.0 / 175.0], [0.0, 25.0 / 175.0]])

    def test_different_sizes(self):
        # Test boxes of different sizes
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[0, 0, 20, 20]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.3μs -> 5.12μs (511% faster)
        # Intersection area: 10x10=100, Area1=100, Area2=400, union=400
        expected = 100 / 400

    # ---------------------------
    # 2. Edge Test Cases
    # ---------------------------

    def test_zero_area_box(self):
        # Test boxes with zero area (x1==x2 or y1==y2)
        box1 = torch.tensor([[0, 0, 0, 10]])  # zero width
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.2μs -> 5.04μs (518% faster)

    def test_touching_boxes(self):
        # Test boxes that touch at the edge but do not overlap
        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.tensor([[10, 0, 20, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.2μs -> 5.12μs (508% faster)

    def test_negative_coordinates(self):
        # Test boxes with negative coordinates
        box1 = torch.tensor([[-10, -10, 0, 0]])
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.2μs -> 5.08μs (514% faster)

    def test_fully_contained(self):
        # Test box1 fully contained in box2
        box1 = torch.tensor([[2, 2, 8, 8]])
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 31.1μs -> 5.08μs (512% faster)
        # Intersection = 36, Area1=36, Area2=100, union=100
        expected = 36 / 100

    def test_empty_input(self):
        # Test empty input tensors
        box1 = torch.empty((0, 4))
        box2 = torch.empty((0, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 28.5μs -> 5.83μs (389% faster)

    def test_one_empty_input(self):
        # Test one input empty, one non-empty
        box1 = torch.empty((0, 4))
        box2 = torch.tensor([[0, 0, 10, 10]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 33.1μs -> 5.71μs (480% faster)

        box1 = torch.tensor([[0, 0, 10, 10]])
        box2 = torch.empty((0, 4))
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 30.2μs -> 4.50μs (572% faster)

    def test_float_precision(self):
        # Test float coordinates and precision
        box1 = torch.tensor([[0.0, 0.0, 10.0, 10.0]])
        box2 = torch.tensor([[5.0, 5.0, 15.0, 15.0]])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 29.9μs -> 5.21μs (474% faster)
        expected = 25.0 / 175.0

    # ---------------------------
    # 3. Large Scale Test Cases
    # ---------------------------

    def test_large_batch(self):
        # Test with large number of boxes (up to 500x500 = 250,000 IoUs, <100MB)
        N, M = 500, 500
        # Boxes: random, but ensure they are valid (x2>x1, y2>y1)
        torch.manual_seed(42)
        box1 = torch.rand(N, 4) * 1000
        box2 = torch.rand(M, 4) * 1000
        # Ensure x2>x1 and y2>y1 for both
        box1[:, 2] = box1[:, 0] + torch.abs(box1[:, 2] - box1[:, 0]) + 1
        box1[:, 3] = box1[:, 1] + torch.abs(box1[:, 3] - box1[:, 1]) + 1
        box2[:, 2] = box2[:, 0] + torch.abs(box2[:, 2] - box2[:, 0]) + 1
        box2[:, 3] = box2[:, 1] + torch.abs(box2[:, 3] - box2[:, 1]) + 1

        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 631μs -> 414μs (52.4% faster)

    def test_large_batch_identical(self):
        # Test large batch where all boxes are identical (should all be IoU=1)
        N = 100
        box = torch.tensor([[0, 0, 10, 10]]).repeat(N, 1)
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box, box)
        iou = codeflash_output  # 100μs -> 24.3μs (314% faster)

    def test_large_batch_non_overlapping(self):
        # Test large batch where no boxes overlap (IoU=0)
        N = 100
        box1 = torch.stack([torch.tensor([i * 20, 0, i * 20 + 10, 10]) for i in range(N)])
        box2 = torch.stack([torch.tensor([i * 20 + 1000, 0, i * 20 + 1010, 10]) for i in range(N)])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 95.2μs -> 24.1μs (295% faster)

    def test_large_batch_partial_overlap(self):
        # Test large batch with partial overlaps (each box overlaps with its pair)
        N = 100
        box1 = torch.stack([torch.tensor([i * 10, 0, i * 10 + 10, 10]) for i in range(N)])
        box2 = torch.stack([torch.tensor([i * 10 + 5, 0, i * 10 + 15, 10]) for i in range(N)])
        codeflash_output = ObjectDetectionEvalProcessor._box_iou(box1, box2)
        iou = codeflash_output  # 94.0μs -> 24.0μs (292% faster)
        # Each pair: overlap area = 5*10=50, union = 10*10+10*10-50=150, IoU=50/150=1/3
        expected = torch.full((N, N), 0.0)
        for i in range(N):
            expected[i, i] = 1 / 3


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-ObjectDetectionEvalProcessor._box_iou-mjcemef3 and push.

Codeflash Static Badge

The optimization replaces PyTorch's vectorized IoU calculation with a Numba-compiled implementation that provides significant speedup for this specific workload.

**Key Changes:**
- **Numba JIT compilation**: Added `@njit(fastmath=True, cache=True)` decorator to create a compiled numpy-based IoU function that eliminates Python overhead
- **Explicit nested loops**: Replaced PyTorch's broadcasting operations with explicit loops that are highly optimized by Numba's compiler
- **Data conversion**: Converts PyTorch tensors to numpy arrays, processes with Numba, then converts back

**Why It's Faster:**
1. **Eliminated broadcasting overhead**: The original code used PyTorch's `[:, None, 2:]` broadcasting which creates large intermediate tensors. The optimized version uses direct indexing in compiled loops
2. **Reduced memory allocations**: Numba's compiled code avoids creating multiple intermediate tensors for min/max/clamp operations
3. **JIT compilation benefits**: The `fastmath=True` flag enables aggressive floating-point optimizations, while `cache=True` ensures compilation happens only once

**Performance Profile:**
- Small batches (1-4 boxes): **5-6x speedup** - conversion overhead is minimal compared to computation savings
- Medium batches (100 boxes): **3-5x speedup** - optimal sweet spot for this approach
- Large batches (500x500): **1.5x speedup** - still beneficial but diminishing returns due to conversion costs

The optimization is particularly effective for object detection evaluation pipelines where IoU calculations are performed repeatedly on moderately-sized batches of bounding boxes, which is the typical use case for this metrics module.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 19, 2025 05:03
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant