⚡️ Speed up method `SD3DenoiseInvocation._prepare_cfg_scale` by 17% #154

codeflash-ai · 2025-11-13T02:26:14Z

📄 17% (0.17x) speedup for `SD3DenoiseInvocation._prepare_cfg_scale` in `invokeai/app/invocations/sd3_denoise.py`

⏱️ Runtime : 41.8 microseconds → 35.6 microseconds (best of 174 runs)

📝 Explanation and details

The optimization achieves a 17% speedup by eliminating redundant attribute lookups and restructuring the control flow for better efficiency.

Key optimizations applied:

Single attribute lookup: The original code accessed self.cfg_scale multiple times (up to 3 times in worst case). The optimized version stores it in a local variable cfg_scale = self.cfg_scale once, eliminating repeated attribute access overhead.
Early returns: Instead of using elif and a final return cfg_scale statement, the optimized code uses early returns (return [cfg_scale] * num_timesteps and return cfg_scale), reducing the execution path length.
Removed variable assignment: The original code unnecessarily assigned to cfg_scale variable in both branches before returning. The optimized version returns directly, eliminating intermediate assignments.

Why this leads to speedup:

Attribute access cost: In Python, self.cfg_scale involves dictionary lookups which are more expensive than local variable access
Reduced branching: Early returns eliminate the need for the final return cfg_scale statement and reduce code paths
Fewer operations: Eliminates intermediate variable assignments that don't add value

Performance impact by test cases:
The optimization shows consistent improvements across all scenarios:

List inputs: 16-36% faster (best case), as they benefit most from avoiding redundant attribute lookups
Float inputs: 8-24% faster, with larger improvements for edge cases like zero/negative timesteps
Error cases: 5-19% faster, even when raising exceptions

This function appears to be part of SD3 (Stable Diffusion 3) denoising pipeline where CFG (Classifier-Free Guidance) scaling is applied at each timestep. Given that denoising typically involves hundreds of timesteps, even small per-call optimizations can compound to meaningful performance gains in image generation workflows.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 72 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	85.7%

🌀 Generated Regression Tests and Runtime

import pytest
from invokeai.app.invocations.sd3_denoise import SD3DenoiseInvocation

# unit tests

# ===== Basic Test Cases =====

def test_cfg_scale_float_basic():
    """Test with float cfg_scale and small num_timesteps."""
    inv = SD3DenoiseInvocation(cfg_scale=3.5)
    codeflash_output = inv._prepare_cfg_scale(4); result = codeflash_output # 941ns -> 866ns (8.66% faster)

def test_cfg_scale_list_basic():
    """Test with list cfg_scale matching num_timesteps."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0])
    codeflash_output = inv._prepare_cfg_scale(3); result = codeflash_output # 972ns -> 754ns (28.9% faster)

def test_cfg_scale_float_single_step():
    """Test with float cfg_scale and num_timesteps=1."""
    inv = SD3DenoiseInvocation(cfg_scale=7.2)
    codeflash_output = inv._prepare_cfg_scale(1); result = codeflash_output # 917ns -> 834ns (9.95% faster)

def test_cfg_scale_list_single_step():
    """Test with list cfg_scale, single element, num_timesteps=1."""
    inv = SD3DenoiseInvocation(cfg_scale=[8.1])
    codeflash_output = inv._prepare_cfg_scale(1); result = codeflash_output # 983ns -> 768ns (28.0% faster)

# ===== Edge Test Cases =====

def test_cfg_scale_list_length_mismatch_short():
    """Test with list cfg_scale shorter than num_timesteps (should raise AssertionError)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(3) # 1.52μs -> 1.36μs (11.5% faster)

def test_cfg_scale_list_length_mismatch_long():
    """Test with list cfg_scale longer than num_timesteps (should raise AssertionError)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0, 4.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(3) # 1.46μs -> 1.25μs (16.6% faster)

def test_cfg_scale_zero_timesteps_float():
    """Test with float cfg_scale and num_timesteps=0 (should return empty list)."""
    inv = SD3DenoiseInvocation(cfg_scale=2.5)
    codeflash_output = inv._prepare_cfg_scale(0); result = codeflash_output # 995ns -> 874ns (13.8% faster)

def test_cfg_scale_zero_timesteps_list():
    """Test with list cfg_scale and num_timesteps=0 (should raise AssertionError if list not empty)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(0) # 1.46μs -> 1.29μs (13.1% faster)

def test_cfg_scale_zero_timesteps_list_empty():
    """Test with empty list cfg_scale and num_timesteps=0 (should succeed)."""
    inv = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = inv._prepare_cfg_scale(0); result = codeflash_output # 1.05μs -> 840ns (25.1% faster)



def test_cfg_scale_list_with_mixed_types():
    """Test with list cfg_scale containing both floats and ints."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2, 3.5])
    codeflash_output = inv._prepare_cfg_scale(3); result = codeflash_output # 1.22μs -> 894ns (36.9% faster)

def test_cfg_scale_list_empty_nonzero_timesteps():
    """Test with empty list cfg_scale and num_timesteps > 0 (should raise AssertionError)."""
    inv = SD3DenoiseInvocation(cfg_scale=[])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(2) # 1.54μs -> 1.44μs (7.22% faster)

def test_cfg_scale_float_negative_timesteps():
    """Test with float cfg_scale and negative num_timesteps (should return empty list)."""
    inv = SD3DenoiseInvocation(cfg_scale=1.5)
    codeflash_output = inv._prepare_cfg_scale(-2); result = codeflash_output # 1.07μs -> 865ns (24.0% faster)

def test_cfg_scale_list_negative_timesteps():
    """Test with list cfg_scale and negative num_timesteps (should raise AssertionError if list not empty)."""
    inv = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(-2) # 1.48μs -> 1.32μs (12.0% faster)

def test_cfg_scale_list_empty_negative_timesteps():
    """Test with empty list cfg_scale and negative num_timesteps (should succeed)."""
    inv = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = inv._prepare_cfg_scale(-2); result = codeflash_output

# ===== Large Scale Test Cases =====

def test_cfg_scale_float_large_timesteps():
    """Test with float cfg_scale and large num_timesteps."""
    large_steps = 999
    value = 2.0
    inv = SD3DenoiseInvocation(cfg_scale=value)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.57μs -> 1.35μs (16.2% faster)

def test_cfg_scale_list_large_timesteps():
    """Test with large list cfg_scale and matching num_timesteps."""
    large_steps = 999
    values = [float(i) for i in range(large_steps)]
    inv = SD3DenoiseInvocation(cfg_scale=values)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.16μs -> 881ns (31.6% faster)

def test_cfg_scale_list_large_timesteps_mismatch():
    """Test with large list cfg_scale and mismatched num_timesteps (should raise AssertionError)."""
    large_steps = 999
    values = [float(i) for i in range(large_steps + 1)]
    inv = SD3DenoiseInvocation(cfg_scale=values)
    with pytest.raises(AssertionError):
        inv._prepare_cfg_scale(large_steps) # 1.57μs -> 1.49μs (5.58% faster)

def test_cfg_scale_float_large_timesteps_zero_value():
    """Test with float cfg_scale=0.0 and large num_timesteps."""
    large_steps = 999
    inv = SD3DenoiseInvocation(cfg_scale=0.0)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.38μs -> 1.27μs (8.16% faster)

def test_cfg_scale_list_large_timesteps_all_same_value():
    """Test with large list cfg_scale, all elements identical."""
    large_steps = 999
    value = 4.2
    values = [value] * large_steps
    inv = SD3DenoiseInvocation(cfg_scale=values)
    codeflash_output = inv._prepare_cfg_scale(large_steps); result = codeflash_output # 1.06μs -> 851ns (24.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pytest
from invokeai.app.invocations.sd3_denoise import SD3DenoiseInvocation

# unit tests

# ---- Basic Test Cases ----

def test_basic_float_cfg_scale():
    # Single float, num_timesteps=5
    invocation = SD3DenoiseInvocation(cfg_scale=3.5)
    codeflash_output = invocation._prepare_cfg_scale(5); result = codeflash_output # 1.44μs -> 945ns (52.6% faster)

def test_basic_list_cfg_scale():
    # List of floats, num_timesteps matches length
    invocation = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0])
    codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 989ns -> 851ns (16.2% faster)

def test_basic_float_cfg_scale_one_step():
    # Single float, num_timesteps=1
    invocation = SD3DenoiseInvocation(cfg_scale=7.5)
    codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 965ns -> 827ns (16.7% faster)

def test_basic_list_cfg_scale_one_step():
    # List with one float, num_timesteps=1
    invocation = SD3DenoiseInvocation(cfg_scale=[8.8])
    codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 1.01μs -> 772ns (30.6% faster)

# ---- Edge Test Cases ----

def test_edge_empty_list_cfg_scale():
    # Empty list, num_timesteps=0
    invocation = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 931ns -> 774ns (20.3% faster)

def test_edge_list_length_mismatch_raises():
    # List length does not match num_timesteps, should raise AssertionError
    invocation = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(3) # 1.53μs -> 1.39μs (10.1% faster)

def test_edge_invalid_type_cfg_scale_int():
    # cfg_scale is an int, should raise ValueError
    invocation = SD3DenoiseInvocation(cfg_scale=5)
    with pytest.raises(ValueError):
        invocation._prepare_cfg_scale(2)




def test_edge_cfg_scale_float_zero_timesteps():
    # Float cfg_scale, zero timesteps
    invocation = SD3DenoiseInvocation(cfg_scale=2.0)
    codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.13μs -> 910ns (24.1% faster)

def test_edge_cfg_scale_list_zero_timesteps():
    # List cfg_scale, zero timesteps, must be empty list
    invocation = SD3DenoiseInvocation(cfg_scale=[])
    codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 992ns -> 819ns (21.1% faster)

def test_edge_cfg_scale_list_empty_with_nonzero_timesteps():
    # Empty list, nonzero timesteps, should raise AssertionError
    invocation = SD3DenoiseInvocation(cfg_scale=[])
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(1) # 1.48μs -> 1.40μs (5.58% faster)

# ---- Large Scale Test Cases ----

def test_large_float_cfg_scale():
    # Float cfg_scale, large num_timesteps
    invocation = SD3DenoiseInvocation(cfg_scale=1.5)
    num_timesteps = 999
    codeflash_output = invocation._prepare_cfg_scale(num_timesteps); result = codeflash_output # 1.50μs -> 1.37μs (9.64% faster)

def test_large_list_cfg_scale():
    # List cfg_scale, large num_timesteps
    num_timesteps = 999
    cfg_scale_list = [float(i) for i in range(num_timesteps)]
    invocation = SD3DenoiseInvocation(cfg_scale=cfg_scale_list)
    codeflash_output = invocation._prepare_cfg_scale(num_timesteps); result = codeflash_output # 1.08μs -> 812ns (33.0% faster)

def test_large_list_cfg_scale_length_mismatch():
    # List cfg_scale, length mismatch with num_timesteps, should raise AssertionError
    num_timesteps = 999
    cfg_scale_list = [float(i) for i in range(num_timesteps - 1)]
    invocation = SD3DenoiseInvocation(cfg_scale=cfg_scale_list)
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(num_timesteps) # 1.62μs -> 1.36μs (19.4% faster)


def test_cfg_scale_float_negative_timesteps():
    # Negative timesteps, should return empty list (since [x]*(-n) == [])
    invocation = SD3DenoiseInvocation(cfg_scale=2.2)
    codeflash_output = invocation._prepare_cfg_scale(-5); result = codeflash_output # 1.22μs -> 989ns (22.9% faster)

def test_cfg_scale_list_negative_timesteps():
    # Negative timesteps, should raise AssertionError (len(list) != -n)
    invocation = SD3DenoiseInvocation(cfg_scale=[1.0, 2.0])
    with pytest.raises(AssertionError):
        invocation._prepare_cfg_scale(-2) # 1.55μs -> 1.44μs (7.28% faster)

def test_cfg_scale_float_zero():
    # cfg_scale is 0.0 float, should be repeated
    invocation = SD3DenoiseInvocation(cfg_scale=0.0)
    codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.02μs -> 938ns (9.06% faster)

def test_cfg_scale_list_of_zeros():
    # cfg_scale is list of zeros
    invocation = SD3DenoiseInvocation(cfg_scale=[0.0, 0.0, 0.0])
    codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.00μs -> 802ns (25.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-SD3DenoiseInvocation._prepare_cfg_scale-mhwt4z5q and push.

The optimization achieves a **17% speedup** by eliminating redundant attribute lookups and restructuring the control flow for better efficiency. **Key optimizations applied:** 1. **Single attribute lookup**: The original code accessed `self.cfg_scale` multiple times (up to 3 times in worst case). The optimized version stores it in a local variable `cfg_scale = self.cfg_scale` once, eliminating repeated attribute access overhead. 2. **Early returns**: Instead of using `elif` and a final `return cfg_scale` statement, the optimized code uses early returns (`return [cfg_scale] * num_timesteps` and `return cfg_scale`), reducing the execution path length. 3. **Removed variable assignment**: The original code unnecessarily assigned to `cfg_scale` variable in both branches before returning. The optimized version returns directly, eliminating intermediate assignments. **Why this leads to speedup:** - **Attribute access cost**: In Python, `self.cfg_scale` involves dictionary lookups which are more expensive than local variable access - **Reduced branching**: Early returns eliminate the need for the final `return cfg_scale` statement and reduce code paths - **Fewer operations**: Eliminates intermediate variable assignments that don't add value **Performance impact by test cases:** The optimization shows consistent improvements across all scenarios: - **List inputs**: 16-36% faster (best case), as they benefit most from avoiding redundant attribute lookups - **Float inputs**: 8-24% faster, with larger improvements for edge cases like zero/negative timesteps - **Error cases**: 5-19% faster, even when raising exceptions This function appears to be part of SD3 (Stable Diffusion 3) denoising pipeline where CFG (Classifier-Free Guidance) scaling is applied at each timestep. Given that denoising typically involves hundreds of timesteps, even small per-call optimizations can compound to meaningful performance gains in image generation workflows.

codeflash-ai bot requested a review from mashraf-222 November 13, 2025 02:26

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `SD3DenoiseInvocation._prepare_cfg_scale` by 17% #154

⚡️ Speed up method `SD3DenoiseInvocation._prepare_cfg_scale` by 17% #154

Uh oh!

codeflash-ai bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method SD3DenoiseInvocation._prepare_cfg_scale by 17% #154

Are you sure you want to change the base?

⚡️ Speed up method SD3DenoiseInvocation._prepare_cfg_scale by 17% #154

Uh oh!

Conversation

codeflash-ai bot commented Nov 13, 2025

📄 17% (0.17x) speedup for SD3DenoiseInvocation._prepare_cfg_scale in invokeai/app/invocations/sd3_denoise.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `SD3DenoiseInvocation._prepare_cfg_scale` by 17% #154

⚡️ Speed up method `SD3DenoiseInvocation._prepare_cfg_scale` by 17% #154

📄 17% (0.17x) speedup for `SD3DenoiseInvocation._prepare_cfg_scale` in `invokeai/app/invocations/sd3_denoise.py`