Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 17% (0.17x) speedup for SD3DenoiseInvocation._prepare_cfg_scale in invokeai/app/invocations/sd3_denoise.py

⏱️ Runtime : 41.8 microseconds 35.6 microseconds (best of 174 runs)

📝 Explanation and details

The optimization achieves a 17% speedup by eliminating redundant attribute lookups and restructuring the control flow for better efficiency.

Key optimizations applied:

  1. Single attribute lookup: The original code accessed self.cfg_scale multiple times (up to 3 times in worst case). The optimized version stores it in a local variable cfg_scale = self.cfg_scale once, eliminating repeated attribute access overhead.

  2. Early returns: Instead of using elif and a final return cfg_scale statement, the optimized code uses early returns (return [cfg_scale] * num_timesteps and return cfg_scale), reducing the execution path length.

  3. Removed variable assignment: The original code unnecessarily assigned to cfg_scale variable in both branches before returning. The optimized version returns directly, eliminating intermediate assignments.

Why this leads to speedup:

  • Attribute access cost: In Python, self.cfg_scale involves dictionary lookups which are more expensive than local variable access
  • Reduced branching: Early returns eliminate the need for the final return cfg_scale statement and reduce code paths
  • Fewer operations: Eliminates intermediate variable assignments that don't add value

Performance impact by test cases:
The optimization shows consistent improvements across all scenarios:

  • List inputs: 16-36% faster (best case), as they benefit most from avoiding redundant attribute lookups
  • Float inputs: 8-24% faster, with larger improvements for edge cases like zero/negative timesteps
  • Error cases: 5-19% faster, even when raising exceptions

This function appears to be part of SD3 (Stable Diffusion 3) denoising pipeline where CFG (Classifier-Free Guidance) scaling is applied at each timestep. Given that denoising typically involves hundreds of timesteps, even small per-call optimizations can compound to meaningful performance gains in image generation workflows.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 72 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 85.7%
🌀 Generated Regression Tests and Runtime
import pytest
from invokeai.app.invocations.sd3_denoise import SD3DenoiseInvocation

# unit tests

# ===== Basic Test Cases =====

def test_cfg_scale_float_basic():
    """Test with float cfg_scale and small num_timesteps."""
    inv = SD3DenoiseInvocation(cfg_scale=3.5)
    codeflash_output = inv._prepare_cfg_scale(4); result = codeflash_output # 941ns -> 866ns (8.66% faster)

def test_cfg_scale_list_basic():
    """Test with list cfg_scale matching num_timesteps."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0])
    codeflash_output = inv._prepare_cfg_scale(3); result = codeflash_output # 972ns -> 754ns (28.9% faster)

def test_cfg_scale_float_single_step():
    """Test with float cfg_scale and num_timesteps=1."""
    inv = SD3DenoiseInvocation(cfg_scale=7.2)
    codeflash_output = inv._prepare_cfg_scale(1); result = codeflash_output # 917ns -> 834ns (9.95% faster)

def test_cfg_scale_list_single_step():
    """Test with list cfg_scale, single element, num_timesteps=1."""
    inv = SD3DenoiseInvocation(cfg_scale=[8.1])
    codeflash_output = inv._prepare_cfg_scale(1); result = codeflash_output # 983ns -> 768ns (28.0% faster)

# ===== Edge Test Cases =====

def test_cfg_scale_list_length_mismatch_short():
    """Test with list cfg_scale shorter than num_timesteps (should raise AssertionError)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(3) # 1.52μs -> 1.36μs (11.5% faster)

def test_cfg_scale_list_length_mismatch_long():
    """Test with list cfg_scale longer than num_timesteps (should raise AssertionError)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0, 4.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(3) # 1.46μs -> 1.25μs (16.6% faster)

def test_cfg_scale_zero_timesteps_float():
    """Test with float cfg_scale and num_timesteps=0 (should return empty list)."""
    inv = SD3DenoiseInvocation(cfg_scale=2.5)
    codeflash_output = inv._prepare_cfg_scale(0); result = codeflash_output # 995ns -> 874ns (13.8% faster)

def test_cfg_scale_zero_timesteps_list():
    """Test with list cfg_scale and num_timesteps=0 (should raise AssertionError if list not empty)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(0) # 1.46μs -> 1.29μs (13.1% faster)

def test_cfg_scale_zero_timesteps_list_empty():
    """Test with empty list cfg_scale and num_timesteps=0 (should succeed)."""
    inv = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = inv._prepare_cfg_scale(0); result = codeflash_output # 1.05μs -> 840ns (25.1% faster)



def test_cfg_scale_list_with_mixed_types():
    """Test with list cfg_scale containing both floats and ints."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2, 3.5])
    codeflash_output = inv._prepare_cfg_scale(3); result = codeflash_output # 1.22μs -> 894ns (36.9% faster)

def test_cfg_scale_list_empty_nonzero_timesteps():
    """Test with empty list cfg_scale and num_timesteps > 0 (should raise AssertionError)."""
    inv = SD3DenoiseInvocation(cfg_scale=[])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(2) # 1.54μs -> 1.44μs (7.22% faster)

def test_cfg_scale_float_negative_timesteps():
    """Test with float cfg_scale and negative num_timesteps (should return empty list)."""
    inv = SD3DenoiseInvocation(cfg_scale=1.5)
    codeflash_output = inv._prepare_cfg_scale(-2); result = codeflash_output # 1.07μs -> 865ns (24.0% faster)

def test_cfg_scale_list_negative_timesteps():
    """Test with list cfg_scale and negative num_timesteps (should raise AssertionError if list not empty)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(-2) # 1.48μs -> 1.32μs (12.0% faster)

def test_cfg_scale_list_empty_negative_timesteps():
    """Test with empty list cfg_scale and negative num_timesteps (should succeed)."""
    inv = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = inv._prepare_cfg_scale(-2); result = codeflash_output

# ===== Large Scale Test Cases =====

def test_cfg_scale_float_large_timesteps():
    """Test with float cfg_scale and large num_timesteps."""
    large_steps = 999
    value = 2.0
    inv = SD3DenoiseInvocation(cfg_scale=value)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.57μs -> 1.35μs (16.2% faster)

def test_cfg_scale_list_large_timesteps():
    """Test with large list cfg_scale and matching num_timesteps."""
    large_steps = 999
    values = [float(i) for i in range(large_steps)]
    inv = SD3DenoiseInvocation(cfg_scale=values)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.16μs -> 881ns (31.6% faster)

def test_cfg_scale_list_large_timesteps_mismatch():
    """Test with large list cfg_scale and mismatched num_timesteps (should raise AssertionError)."""
    large_steps = 999
    values = [float(i) for i in range(large_steps + 1)]
    inv = SD3DenoiseInvocation(cfg_scale=values)
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(large_steps) # 1.57μs -> 1.49μs (5.58% faster)

def test_cfg_scale_float_large_timesteps_zero_value():
    """Test with float cfg_scale=0.0 and large num_timesteps."""
    large_steps = 999
    inv = SD3DenoiseInvocation(cfg_scale=0.0)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.38μs -> 1.27μs (8.16% faster)

def test_cfg_scale_list_large_timesteps_all_same_value():
    """Test with large list cfg_scale, all elements identical."""
    large_steps = 999
    value = 4.2
    values = [value] * large_steps
    inv = SD3DenoiseInvocation(cfg_scale=values)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.06μs -> 851ns (24.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from invokeai.app.invocations.sd3_denoise import SD3DenoiseInvocation

# unit tests

# ---- Basic Test Cases ----

def test_basic_float_cfg_scale():
    # Single float, num_timesteps=5
    invocation = SD3DenoiseInvocation(cfg_scale=3.5)
    codeflash_output = invocation._prepare_cfg_scale(5); result = codeflash_output # 1.44μs -> 945ns (52.6% faster)

def test_basic_list_cfg_scale():
    # List of floats, num_timesteps matches length
    invocation = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0])
    codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 989ns -> 851ns (16.2% faster)

def test_basic_float_cfg_scale_one_step():
    # Single float, num_timesteps=1
    invocation = SD3DenoiseInvocation(cfg_scale=7.5)
    codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 965ns -> 827ns (16.7% faster)

def test_basic_list_cfg_scale_one_step():
    # List with one float, num_timesteps=1
    invocation = SD3DenoiseInvocation(cfg_scale=[8.8])
    codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 1.01μs -> 772ns (30.6% faster)

# ---- Edge Test Cases ----

def test_edge_empty_list_cfg_scale():
    # Empty list, num_timesteps=0
    invocation = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 931ns -> 774ns (20.3% faster)

def test_edge_list_length_mismatch_raises():
    # List length does not match num_timesteps, should raise AssertionError
    invocation = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(3) # 1.53μs -> 1.39μs (10.1% faster)

def test_edge_invalid_type_cfg_scale_int():
    # cfg_scale is an int, should raise ValueError
    invocation = SD3DenoiseInvocation(cfg_scale=5)
    with pytest.raises(ValueError):
        invocation._prepare_cfg_scale(2)




def test_edge_cfg_scale_float_zero_timesteps():
    # Float cfg_scale, zero timesteps
    invocation = SD3DenoiseInvocation(cfg_scale=2.0)
    codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.13μs -> 910ns (24.1% faster)

def test_edge_cfg_scale_list_zero_timesteps():
    # List cfg_scale, zero timesteps, must be empty list
    invocation = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 992ns -> 819ns (21.1% faster)

def test_edge_cfg_scale_list_empty_with_nonzero_timesteps():
    # Empty list, nonzero timesteps, should raise AssertionError
    invocation = SD3DenoiseInvocation(cfg_scale=[])
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(1) # 1.48μs -> 1.40μs (5.58% faster)

# ---- Large Scale Test Cases ----

def test_large_float_cfg_scale():
    # Float cfg_scale, large num_timesteps
    invocation = SD3DenoiseInvocation(cfg_scale=1.5)
    num_timesteps = 999
    codeflash_output = invocation._prepare_cfg_scale(num_timesteps); result = codeflash_output # 1.50μs -> 1.37μs (9.64% faster)

def test_large_list_cfg_scale():
    # List cfg_scale, large num_timesteps
    num_timesteps = 999
    cfg_scale_list = [float(i) for i in range(num_timesteps)]
    invocation = SD3DenoiseInvocation(cfg_scale=cfg_scale_list)
    codeflash_output = invocation._prepare_cfg_scale(num_timesteps); result = codeflash_output # 1.08μs -> 812ns (33.0% faster)

def test_large_list_cfg_scale_length_mismatch():
    # List cfg_scale, length mismatch with num_timesteps, should raise AssertionError
    num_timesteps = 999
    cfg_scale_list = [float(i) for i in range(num_timesteps - 1)]
    invocation = SD3DenoiseInvocation(cfg_scale=cfg_scale_list)
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(num_timesteps) # 1.62μs -> 1.36μs (19.4% faster)


def test_cfg_scale_float_negative_timesteps():
    # Negative timesteps, should return empty list (since [x]*(-n) == [])
    invocation = SD3DenoiseInvocation(cfg_scale=2.2)
    codeflash_output = invocation._prepare_cfg_scale(-5); result = codeflash_output # 1.22μs -> 989ns (22.9% faster)

def test_cfg_scale_list_negative_timesteps():
    # Negative timesteps, should raise AssertionError (len(list) != -n)
    invocation = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(-2) # 1.55μs -> 1.44μs (7.28% faster)

def test_cfg_scale_float_zero():
    # cfg_scale is 0.0 float, should be repeated
    invocation = SD3DenoiseInvocation(cfg_scale=0.0)
    codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.02μs -> 938ns (9.06% faster)

def test_cfg_scale_list_of_zeros():
    # cfg_scale is list of zeros
    invocation = SD3DenoiseInvocation(cfg_scale=[0.0, 0.0, 0.0])
    codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.00μs -> 802ns (25.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-SD3DenoiseInvocation._prepare_cfg_scale-mhwt4z5q and push.

Codeflash Static Badge

The optimization achieves a **17% speedup** by eliminating redundant attribute lookups and restructuring the control flow for better efficiency.

**Key optimizations applied:**

1. **Single attribute lookup**: The original code accessed `self.cfg_scale` multiple times (up to 3 times in worst case). The optimized version stores it in a local variable `cfg_scale = self.cfg_scale` once, eliminating repeated attribute access overhead.

2. **Early returns**: Instead of using `elif` and a final `return cfg_scale` statement, the optimized code uses early returns (`return [cfg_scale] * num_timesteps` and `return cfg_scale`), reducing the execution path length.

3. **Removed variable assignment**: The original code unnecessarily assigned to `cfg_scale` variable in both branches before returning. The optimized version returns directly, eliminating intermediate assignments.

**Why this leads to speedup:**
- **Attribute access cost**: In Python, `self.cfg_scale` involves dictionary lookups which are more expensive than local variable access
- **Reduced branching**: Early returns eliminate the need for the final `return cfg_scale` statement and reduce code paths
- **Fewer operations**: Eliminates intermediate variable assignments that don't add value

**Performance impact by test cases:**
The optimization shows consistent improvements across all scenarios:
- **List inputs**: 16-36% faster (best case), as they benefit most from avoiding redundant attribute lookups
- **Float inputs**: 8-24% faster, with larger improvements for edge cases like zero/negative timesteps
- **Error cases**: 5-19% faster, even when raising exceptions

This function appears to be part of SD3 (Stable Diffusion 3) denoising pipeline where CFG (Classifier-Free Guidance) scaling is applied at each timestep. Given that denoising typically involves hundreds of timesteps, even small per-call optimizations can compound to meaningful performance gains in image generation workflows.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 02:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant