Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 27% (0.27x) speedup for WatermarkDecoder.reconstruct in invokeai/backend/image_util/imwatermark/vendor.py

⏱️ Runtime : 1.22 milliseconds 957 microseconds (best of 119 runs)

📝 Explanation and details

The optimized code achieves a 27% speedup by eliminating inefficient byte concatenation and string operations in three key methods:

What optimizations were applied:

  1. reconstruct_ipv4: Replaced list comprehension with string conversion ([str(ip) for ip in list(np.packbits(bits))]) with direct .format() string formatting using indexed array access. This avoids creating an intermediate list and multiple string conversions.

  2. reconstruct_uuid: Eliminated the expensive loop that repeatedly concatenates bytes (bstr += struct.pack(">B", nums[i])) and replaced it with a single bytes(nums[:16]) call. This removes Python-level iteration and repeated immutable bytes object creation.

  3. reconstruct_bytes: Replaced the loop-based byte concatenation pattern with direct slicing and bytes() constructor (bytes(nums[:end_idx])), eliminating the expensive repeated concatenation of immutable bytes objects.

Why these optimizations are faster:

  • Avoided repeated concatenation: The original code used bstr += ... in loops, which creates new bytes objects each iteration due to immutability, resulting in O(n²) memory allocations
  • Reduced Python overhead: Eliminated explicit loops in favor of vectorized NumPy operations and built-in constructors
  • Minimized intermediate objects: Direct array slicing and constructor calls reduce temporary object creation

Performance impact by test type:

  • UUID operations: Show the largest gains (15-26% faster) due to eliminating the 16-iteration concatenation loop
  • Large-scale operations: Benefit significantly, with the large bytes test showing 52% improvement due to reduced loop overhead
  • IPv4 operations: Consistent 8-21% improvements from avoiding list creation and string conversions
  • Simple cases: Still benefit from reduced allocation overhead, showing 5-19% gains

The optimization is particularly effective for watermark decoding workloads that process multiple or large watermarks, as the byte manipulation operations are core to the decoding process.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 292 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import base64
import struct
import uuid

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from invokeai.backend.image_util.imwatermark.vendor import WatermarkDecoder

# unit tests

# --- BASIC TEST CASES ---

def test_reconstruct_bytes_basic():
    # Test reconstructing bytes from bits
    decoder = WatermarkDecoder(wm_type="bytes", length=16)
    bits = [0,0,0,0,0,0,0,1, 0,0,0,0,0,0,1,0]  # Should produce b'\x01\x02'
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 9.10μs -> 8.21μs (10.9% faster)

def test_reconstruct_bits_basic():
    # Test reconstructing bits (identity)
    decoder = WatermarkDecoder(wm_type="bits", length=8)
    bits = [1,0,1,0,1,0,1,0]
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 902ns -> 861ns (4.76% faster)

def test_reconstruct_ipv4_basic():
    # Test reconstructing IPv4 from bits
    decoder = WatermarkDecoder(wm_type="ipv4")
    # 32 bits representing 4 bytes: 192, 168, 1, 1
    bits = list(np.unpackbits(np.array([192,168,1,1], dtype=np.uint8)))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 7.20μs -> 6.61μs (8.84% faster)

def test_reconstruct_uuid_basic():
    # Test reconstructing UUID from bits
    decoder = WatermarkDecoder(wm_type="uuid")
    # UUID: 12345678-1234-5678-1234-567812345678
    uuid_bytes = uuid.UUID("12345678-1234-5678-1234-567812345678").bytes
    bits = list(np.unpackbits(np.frombuffer(uuid_bytes, dtype=np.uint8)))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 17.1μs -> 13.8μs (23.7% faster)

def test_reconstruct_b16_basic():
    # Test reconstructing base16 from bits
    decoder = WatermarkDecoder(wm_type="b16", length=16)
    bits = [0,0,0,0,0,0,0,1, 0,0,0,0,0,0,1,0]  # Should produce b'\x01\x02' -> b'0102'
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 10.2μs -> 9.69μs (5.79% faster)

# --- EDGE TEST CASES ---

def test_reconstruct_bytes_all_zeros():
    # All bits zero
    decoder = WatermarkDecoder(wm_type="bytes", length=8)
    bits = [0]*8
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 7.34μs -> 6.94μs (5.69% faster)

def test_reconstruct_bytes_all_ones():
    # All bits one
    decoder = WatermarkDecoder(wm_type="bytes", length=8)
    bits = [1]*8
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 7.21μs -> 6.30μs (14.5% faster)

def test_reconstruct_ipv4_edge_values():
    # Edge IPv4 values: 0.255.127.128
    decoder = WatermarkDecoder(wm_type="ipv4")
    bits = list(np.unpackbits(np.array([0,255,127,128], dtype=np.uint8)))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 6.79μs -> 6.69μs (1.53% faster)

def test_reconstruct_uuid_edge():
    # Edge UUID: all zeros
    decoder = WatermarkDecoder(wm_type="uuid")
    bits = [0]*128
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 20.1μs -> 16.8μs (19.7% faster)

def test_reconstruct_b16_empty():
    # Empty bits for b16
    decoder = WatermarkDecoder(wm_type="b16", length=0)
    bits = []
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output

def test_reconstruct_invalid_length_raises():
    # Length mismatch should raise
    decoder = WatermarkDecoder(wm_type="bytes", length=8)
    bits = [0]*7  # Only 7 bits, should be 8
    with pytest.raises(RuntimeError):
        decoder.reconstruct(bits) # 1.16μs -> 1.17μs (1.37% slower)

def test_reconstruct_unsupported_type_raises():
    # Unsupported type should raise NameError on construction
    with pytest.raises(NameError):
        WatermarkDecoder(wm_type="unsupported", length=8)

def test_reconstruct_bytes_not_multiple_of_8():
    # Length not multiple of 8 for bytes
    decoder = WatermarkDecoder(wm_type="bytes", length=9)
    bits = [1]*9
    # Should reconstruct only 1 byte (first 8 bits), ignore last bit
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 15.0μs -> 13.6μs (9.86% faster)

def test_reconstruct_b16_not_multiple_of_8():
    # Length not multiple of 8 for b16
    decoder = WatermarkDecoder(wm_type="b16", length=9)
    bits = [1]*9
    # Should reconstruct only 1 byte, encoded in base16
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 11.4μs -> 10.6μs (6.90% faster)

# --- LARGE SCALE TEST CASES ---

def test_reconstruct_bytes_large_scale():
    # Large scale test for bytes
    length = 1024
    decoder = WatermarkDecoder(wm_type="bytes", length=length)
    # Create bits for sequential bytes: 0,1,2,...,127 (repeated 8 times)
    bytes_arr = np.arange(128, dtype=np.uint8)
    bits = list(np.tile(np.unpackbits(bytes_arr), length//128))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output
    expected_bytes = bytes_arr.tobytes() * (length // 1024)
    # But our bits are repeated enough to fill 1024 bits = 128 bytes
    expected_bytes = bytes_arr.tobytes()

def test_reconstruct_uuid_large_scale():
    # Large scale test for multiple UUIDs (simulate by repeating bits)
    decoder = WatermarkDecoder(wm_type="uuid")
    uuid_val = uuid.UUID("ffffffff-ffff-ffff-ffff-ffffffffffff")
    bits = list(np.unpackbits(np.frombuffer(uuid_val.bytes, dtype=np.uint8)))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 19.8μs -> 17.2μs (14.8% faster)

def test_reconstruct_bits_large_scale():
    # Large scale test for bits
    length = 1000
    decoder = WatermarkDecoder(wm_type="bits", length=length)
    bits = [i%2 for i in range(length)]  # Alternating bits
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 1.13μs -> 1.16μs (2.58% slower)

def test_reconstruct_b16_large_scale():
    # Large scale test for b16
    length = 800
    decoder = WatermarkDecoder(wm_type="b16", length=length)
    # 800 bits = 100 bytes
    bits = [1 if i%2==0 else 0 for i in range(length)]
    bytes_arr = np.packbits(bits)
    expected_b16 = base64.b16encode(bytes(bytes_arr[:length//8]))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 42.4μs -> 27.8μs (52.6% faster)

def test_reconstruct_ipv4_large_scale():
    # Large scale for IPv4: test with random IPv4 address
    decoder = WatermarkDecoder(wm_type="ipv4")
    # 32 bits for IPv4, e.g. 8, 16, 32, 64
    arr = np.array([8,16,32,64], dtype=np.uint8)
    bits = list(np.unpackbits(arr))
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 7.31μs -> 7.63μs (4.23% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import base64
import struct
import uuid

import numpy as np
# imports
import pytest  # used for our unit tests
from invokeai.backend.image_util.imwatermark.vendor import WatermarkDecoder

# unit tests

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_reconstruct_bytes_basic():
    # Test reconstructing 2 bytes from bits
    data = [1,0,1,0, 0,0,0,1, 1,1,1,1, 0,0,0,0]  # 16 bits: 0b10100001 0b11110000
    decoder = WatermarkDecoder(wm_type="bytes", length=16)
    codeflash_output = decoder.reconstruct(data); result = codeflash_output # 14.2μs -> 12.9μs (10.2% faster)

def test_reconstruct_bits_basic():
    # Test reconstructing bits returns the same bits
    data = [0,1,1,0,1,0,1,0]
    decoder = WatermarkDecoder(wm_type="bits", length=8)
    codeflash_output = decoder.reconstruct(data); result = codeflash_output # 972ns -> 945ns (2.86% faster)

def test_reconstruct_ipv4_basic():
    # 32 bits, each byte is an octet of the IPv4 address
    # e.g. 192.168.1.1 = [11000000, 10101000, 00000001, 00000001]
    bits = []
    for octet in [192, 168, 1, 1]:
        bits.extend([(octet >> i) & 1 for i in reversed(range(8))])
    decoder = WatermarkDecoder(wm_type="ipv4")
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 15.4μs -> 13.0μs (18.3% faster)

def test_reconstruct_uuid_basic():
    # 128 bits for UUID: use a known UUID
    u = uuid.UUID('12345678-1234-5678-1234-567812345678')
    bits = []
    for b in u.bytes:
        bits.extend([(b >> i) & 1 for i in reversed(range(8))])
    decoder = WatermarkDecoder(wm_type="uuid")
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 21.5μs -> 18.4μs (17.3% faster)

def test_reconstruct_b16_basic():
    # 16 bits, should return base16 encoding of the bytes
    data = [1,0,1,0, 0,0,0,1, 1,1,1,1, 0,0,0,0]  # 0xA1F0
    decoder = WatermarkDecoder(wm_type="b16", length=16)
    codeflash_output = decoder.reconstruct(data); result = codeflash_output # 11.2μs -> 9.44μs (19.2% faster)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_reconstruct_bytes_zero_length():
    # Test zero-length bytes
    decoder = WatermarkDecoder(wm_type="bytes", length=0)
    codeflash_output = decoder.reconstruct([]); result = codeflash_output

def test_reconstruct_bits_empty():
    # Test zero-length bits
    decoder = WatermarkDecoder(wm_type="bits", length=0)
    codeflash_output = decoder.reconstruct([]); result = codeflash_output # 1.11μs -> 1.26μs (11.5% slower)

def test_reconstruct_b16_empty():
    # Test zero-length b16
    decoder = WatermarkDecoder(wm_type="b16", length=0)
    codeflash_output = decoder.reconstruct([]); result = codeflash_output

def test_reconstruct_bytes_non_byte_aligned():
    # Test when length is not a multiple of 8 (should still work, np.packbits pads with zeros)
    data = [1,1,1,1, 0,0,0]  # 7 bits, not a full byte
    decoder = WatermarkDecoder(wm_type="bytes", length=7)
    # Should raise because 7 is not a multiple of 8 and reconstruct_bytes expects length//8 bytes
    with pytest.raises(Exception):
        decoder.reconstruct(data)

def test_reconstruct_ipv4_all_zeros():
    # All zeros: 0.0.0.0
    bits = [0]*32
    decoder = WatermarkDecoder(wm_type="ipv4")
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 18.3μs -> 15.8μs (15.9% faster)

def test_reconstruct_ipv4_all_ones():
    # All ones: 255.255.255.255
    bits = [1]*32
    decoder = WatermarkDecoder(wm_type="ipv4")
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 11.5μs -> 9.50μs (21.5% faster)

def test_reconstruct_uuid_all_zeros():
    # All zeros UUID
    bits = [0]*128
    decoder = WatermarkDecoder(wm_type="uuid")
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 23.0μs -> 19.8μs (15.7% faster)

def test_reconstruct_uuid_all_ones():
    # All ones UUID (0xffff...ffff)
    bits = [1]*128
    decoder = WatermarkDecoder(wm_type="uuid")
    codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 19.1μs -> 16.4μs (16.9% faster)

def test_reconstruct_invalid_length():
    # Test error if bits length doesn't match expected
    decoder = WatermarkDecoder(wm_type="bytes", length=8)
    with pytest.raises(RuntimeError):
        decoder.reconstruct([1,0,1]) # 981ns -> 975ns (0.615% faster)

def test_unsupported_type():
    # Test unsupported watermark type
    with pytest.raises(NameError):
        WatermarkDecoder(wm_type="unknown")

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_reconstruct_bytes_large():
    # Test reconstructing 1000 bytes (8000 bits)
    data = []
    # Let's make a pattern: 0x55, 0xAA, alternating
    for i in range(1000):
        val = 0x55 if i % 2 == 0 else 0xAA
        data.extend([(val >> j) & 1 for j in reversed(range(8))])
    decoder = WatermarkDecoder(wm_type="bytes", length=8000)
    codeflash_output = decoder.reconstruct(data); result = codeflash_output # 393μs -> 258μs (52.4% faster)
    expected = b''.join([b'\x55' if i % 2 == 0 else b'\xAA' for i in range(1000)])

def test_reconstruct_bits_large():
    # Test reconstructing 1000 bits
    data = [i % 2 for i in range(1000)]  # 0,1,0,1,...
    decoder = WatermarkDecoder(wm_type="bits", length=1000)
    codeflash_output = decoder.reconstruct(data); result = codeflash_output # 1.06μs -> 1.14μs (6.68% slower)

def test_reconstruct_b16_large():
    # Test reconstructing 128 bits (16 bytes) and encoding as base16
    data = []
    for i in range(16):
        val = i
        data.extend([(val >> j) & 1 for j in reversed(range(8))])
    decoder = WatermarkDecoder(wm_type="b16", length=128)
    codeflash_output = decoder.reconstruct(data); result = codeflash_output # 20.1μs -> 16.6μs (21.6% faster)
    expected_bytes = bytes(range(16))

def test_reconstruct_uuid_large_unique():
    # Test reconstructing 10 different UUIDs
    for i in range(10):
        u = uuid.uuid4()
        bits = []
        for b in u.bytes:
            bits.extend([(b >> j) & 1 for j in reversed(range(8))])
        decoder = WatermarkDecoder(wm_type="uuid")
        codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 105μs -> 83.6μs (26.3% faster)

def test_reconstruct_ipv4_many():
    # Test reconstructing 100 different IPv4 addresses
    for i in range(100):
        octets = [(i*3)%256, (i*5)%256, (i*7)%256, (i*11)%256]
        bits = []
        for octet in octets:
            bits.extend([(octet >> j) & 1 for j in reversed(range(8))])
        decoder = WatermarkDecoder(wm_type="ipv4")
        codeflash_output = decoder.reconstruct(bits); result = codeflash_output # 373μs -> 323μs (15.4% faster)
        expected = '.'.join(str(o) for o in octets)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-WatermarkDecoder.reconstruct-mhwwpz39 and push.

Codeflash Static Badge

The optimized code achieves a **27% speedup** by eliminating inefficient byte concatenation and string operations in three key methods:

**What optimizations were applied:**

1. **`reconstruct_ipv4`**: Replaced list comprehension with string conversion (`[str(ip) for ip in list(np.packbits(bits))]`) with direct `.format()` string formatting using indexed array access. This avoids creating an intermediate list and multiple string conversions.

2. **`reconstruct_uuid`**: Eliminated the expensive loop that repeatedly concatenates bytes (`bstr += struct.pack(">B", nums[i])`) and replaced it with a single `bytes(nums[:16])` call. This removes Python-level iteration and repeated immutable bytes object creation.

3. **`reconstruct_bytes`**: Replaced the loop-based byte concatenation pattern with direct slicing and `bytes()` constructor (`bytes(nums[:end_idx])`), eliminating the expensive repeated concatenation of immutable bytes objects.

**Why these optimizations are faster:**
- **Avoided repeated concatenation**: The original code used `bstr += ...` in loops, which creates new bytes objects each iteration due to immutability, resulting in O(n²) memory allocations
- **Reduced Python overhead**: Eliminated explicit loops in favor of vectorized NumPy operations and built-in constructors
- **Minimized intermediate objects**: Direct array slicing and constructor calls reduce temporary object creation

**Performance impact by test type:**
- **UUID operations**: Show the largest gains (15-26% faster) due to eliminating the 16-iteration concatenation loop
- **Large-scale operations**: Benefit significantly, with the large bytes test showing 52% improvement due to reduced loop overhead
- **IPv4 operations**: Consistent 8-21% improvements from avoiding list creation and string conversions
- **Simple cases**: Still benefit from reduced allocation overhead, showing 5-19% gains

The optimization is particularly effective for watermark decoding workloads that process multiple or large watermarks, as the byte manipulation operations are core to the decoding process.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 04:06
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant