Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 19, 2025

📄 15% (0.15x) speedup for nexttoken in unstructured/patches/pdfminer.py

⏱️ Runtime : 1.02 milliseconds 886 microseconds (best of 138 runs)

📝 Explanation and details

The optimization achieves a 14% speedup by eliminating unnecessary logging overhead through conditional log level checking.

Key optimization applied:

  • Conditional logging check: Replaced the unconditional log.debug("nexttoken: %r", token) call with if log.isEnabledFor(10): log.debug("nexttoken: %r", token) where 10 is the logging.DEBUG level constant.

Why this optimization works:
The line profiler results show that the original log.debug() call consumed 30.6% of total execution time (2.031μs out of 6.635μs). Even when debug logging is disabled, Python's logging system still performs expensive string formatting and method resolution. The isEnabledFor() check is a lightweight integer comparison that completely bypasses the costly debug call when logging is disabled.

Performance impact by test case:

  • Simple token access: 21-38% faster for basic token retrieval operations
  • Buffer filling scenarios: 8-11% faster when tokens need to be parsed
  • Large-scale operations: 24% faster when processing 1000+ tokens sequentially
  • Edge cases: Minimal impact (0-6%) for exception handling paths

Real-world benefits:
This optimization is particularly valuable because:

  1. Production environments typically run with logging disabled or at higher levels than DEBUG
  2. Token parsing is likely called frequently in PDF processing pipelines
  3. The speedup compounds when processing large documents with many tokens

The optimization preserves all original behavior while eliminating a significant performance bottleneck through standard Python logging best practices.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 2026 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 71.4%
🌀 Generated Regression Tests and Runtime
# imports
import pytest

from unstructured.patches.pdfminer import nexttoken


# Dummy logger for testing (since pdfminer.psparser.log is used)
class DummyLogger:
    def debug(self, msg, *args):
        pass  # No-op for testing


log = DummyLogger()


# Custom exceptions and token class for testing
class PSEOF(Exception):
    pass


class PSBaseParserToken:
    def __init__(self, value):
        self.value = value

    def __repr__(self):
        return f"PSBaseParserToken({self.value!r})"

    def __eq__(self, other):
        return isinstance(other, PSBaseParserToken) and self.value == other.value


# Dummy PSBaseParser for context (mimics interface needed for nexttoken)
class PSBaseParser:
    def nexttoken(self):
        pass


# Helper: Minimal parser class to allow injection of state for testing
class DummyParser:
    def __init__(
        self,
        tokens=None,
        eof=False,
        buf=b"",
        charpos=0,
        parse1_side_effect=None,
        fillbuf_side_effect=None,
    ):
        self._tokens = tokens if tokens is not None else []
        self.eof = eof
        self.buf = buf
        self.charpos = charpos
        self._parse1 = self._make_parse1(parse1_side_effect)
        self.fillbuf = self._make_fillbuf(fillbuf_side_effect)

    def _make_parse1(self, side_effect):
        def _parse1(buf, charpos):
            if callable(side_effect):
                return side_effect(buf, charpos)
            elif isinstance(side_effect, Exception):
                raise side_effect
            elif side_effect is not None:
                return side_effect
            return charpos

        return _parse1

    def _make_fillbuf(self, side_effect):
        def fillbuf():
            if callable(side_effect):
                return side_effect()
            elif isinstance(side_effect, Exception):
                raise side_effect

        return fillbuf


# ========== BASIC TEST CASES ==========


def test_returns_token_when_tokens_available():
    # Should return the first token when tokens are present
    parser = DummyParser(tokens=[(1, PSBaseParserToken("A")), (2, PSBaseParserToken("B"))])
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 958ns -> 708ns (35.3% faster)


def test_fills_tokens_when_empty():
    # Should call fillbuf and _parse1 to fill tokens if empty
    called = {}

    def fillbuf():
        called["fillbuf"] = True
        parser._tokens.append((123, PSBaseParserToken("new")))

    parser = DummyParser(tokens=[], buf=b"abc", charpos=0, fillbuf_side_effect=fillbuf)
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 2.08μs -> 1.88μs (11.1% faster)


def test_multiple_calls_return_tokens_in_order():
    parser = DummyParser(tokens=[(1, PSBaseParserToken("A")), (2, PSBaseParserToken("B"))])
    codeflash_output = nexttoken(parser)  # 791ns -> 583ns (35.7% faster)
    codeflash_output = nexttoken(parser)  # 375ns -> 292ns (28.4% faster)
    parser._tokens.append((3, PSBaseParserToken("C")))
    codeflash_output = nexttoken(parser)  # 375ns -> 333ns (12.6% faster)


# ========== EDGE TEST CASES ==========


def test_raises_pseof_when_no_tokens_and_parse1_hits_eof_and_no_tokens():
    # fillbuf does nothing, _parse1 raises PSEOF, and after that _tokens still empty
    def parse1(buf, charpos):
        raise PSEOF("Middle of token EOF")

    parser = DummyParser(tokens=[], buf=b"", charpos=0, parse1_side_effect=parse1)
    with pytest.raises(PSEOF):
        nexttoken(parser)  # 1.46μs -> 1.42μs (2.96% faster)


def test_token_is_popped_even_if_only_one():
    parser = DummyParser(tokens=[(5, PSBaseParserToken("last"))])
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 1.00μs -> 750ns (33.3% faster)


def test_fillbuf_and_parse1_called_until_token():
    # fillbuf and _parse1 called repeatedly until _tokens is filled
    call_count = {"fillbuf": 0, "parse1": 0}

    def fillbuf():
        call_count["fillbuf"] += 1
        if call_count["fillbuf"] == 3:
            parser._tokens.append((7, PSBaseParserToken("filled")))

    def parse1(buf, charpos):
        call_count["parse1"] += 1
        return charpos

    parser = DummyParser(tokens=[], fillbuf_side_effect=fillbuf, parse1_side_effect=parse1)
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 3.25μs -> 2.96μs (9.87% faster)


def test_many_tokens_processed_in_sequence():
    # Test with 1000 tokens
    tokens = [(i, PSBaseParserToken(f"T{i}")) for i in range(1000)]
    parser = DummyParser(tokens=tokens.copy())
    for i in range(1000):
        codeflash_output = nexttoken(parser)
        result = codeflash_output  # 319μs -> 256μs (24.7% faster)


def test_large_fillbuf_and_parse1_cycles():
    # Simulate the need to fill tokens 500 times before getting a token
    count = {"fillbuf": 0}

    def fillbuf():
        count["fillbuf"] += 1
        if count["fillbuf"] == 500:
            parser._tokens.append((42, PSBaseParserToken("big")))

    parser = DummyParser(tokens=[], fillbuf_side_effect=fillbuf)
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 131μs -> 130μs (0.961% faster)


def test_token_can_be_any_type():
    # Token values can be any type
    parser = DummyParser(
        tokens=[
            (1, PSBaseParserToken(123)),
            (2, PSBaseParserToken(None)),
            (3, PSBaseParserToken([1, 2, 3])),
        ]
    )
    codeflash_output = nexttoken(parser)  # 1.00μs -> 791ns (26.4% faster)
    codeflash_output = nexttoken(parser)  # 333ns -> 291ns (14.4% faster)
    codeflash_output = nexttoken(parser)  # 333ns -> 291ns (14.4% faster)


def test_parse1_and_fillbuf_can_be_noop_if_tokens_are_present():
    # If tokens are present, fillbuf and _parse1 should not be called
    called = {"fillbuf": False, "parse1": False}

    def fillbuf():
        called["fillbuf"] = True

    def parse1(buf, charpos):
        called["parse1"] = True
        return charpos

    parser = DummyParser(
        tokens=[(1, PSBaseParserToken("X"))], fillbuf_side_effect=fillbuf, parse1_side_effect=parse1
    )
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 750ns -> 542ns (38.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import pytest

from unstructured.patches.pdfminer import nexttoken


# Dummy implementations and helpers for test isolation
class DummyPSEOF(Exception):
    pass


class DummyLog:
    def __init__(self):
        self.last_debug = None

    def debug(self, msg, val):
        self.last_debug = (msg, val)


log = DummyLog()


class PSBaseParserToken:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        if isinstance(other, PSBaseParserToken):
            return self.value == other.value
        return False

    def __repr__(self):
        return f"PSBaseParserToken({repr(self.value)})"


# Dummy base parser for inheritance
class PSBaseParser:
    pass


# Helper: Dummy parser class for testing
class DummyParser(PSBaseParser):
    def __init__(
        self,
        tokens=None,
        eof=False,
        buf=None,
        charpos=0,
        parse1_side_effect=None,
        fillbuf_side_effect=None,
    ):
        self._tokens = tokens if tokens is not None else []
        self.eof = eof
        self.buf = buf
        self.charpos = charpos
        self._parse1_calls = []
        self._parse1_side_effect = parse1_side_effect
        self._fillbuf_side_effect = fillbuf_side_effect

    def fillbuf(self):
        if self._fillbuf_side_effect:
            return self._fillbuf_side_effect()
        # Default: do nothing
        return

    def _parse1(self, buf, charpos):
        self._parse1_calls.append((buf, charpos))
        if self._parse1_side_effect:
            return self._parse1_side_effect(buf, charpos)
        # Default: return next charpos
        return charpos + 1


# -------------------- UNIT TESTS --------------------

# 1. BASIC TEST CASES


def test_nexttoken_returns_token_when_available():
    # Token is available, should return it without further parsing.
    parser = DummyParser(tokens=[(1, PSBaseParserToken("tok1"))])
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 708ns -> 625ns (13.3% faster)


def test_nexttoken_returns_first_token_if_multiple():
    # Multiple tokens, should return the first one and leave the rest.
    parser = DummyParser(tokens=[(1, PSBaseParserToken("tok1")), (2, PSBaseParserToken("tok2"))])
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 708ns -> 584ns (21.2% faster)


def test_nexttoken_fills_buffer_and_parses_if_no_tokens():
    # No tokens, fillbuf and _parse1 should be called, then a token appears.
    def fillbuf():
        parser._tokens.append((42, PSBaseParserToken("tokX")))

    parser = DummyParser(tokens=[], buf=b"abc", charpos=0)
    parser.fillbuf = fillbuf
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 1.67μs -> 1.67μs (0.060% faster)


def test_nexttoken_multiple_empty_then_token():
    # fillbuf called twice before token appears
    calls = []

    def fillbuf():
        calls.append("fillbuf")
        if len(calls) == 2:
            parser._tokens.append((99, PSBaseParserToken("tokY")))

    parser = DummyParser(tokens=[])
    parser.fillbuf = fillbuf
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 2.21μs -> 2.04μs (8.13% faster)


# 2. EDGE TEST CASES


def test_nexttoken_handles_PSEOF_in_parse1_and_raises_if_no_token():
    # If _parse1 raises DummyPSEOF and no tokens are produced, should raise
    def parse1_side_effect(buf, charpos):
        raise DummyPSEOF("mid-token EOF")

    parser = DummyParser(tokens=[], buf=b"abc", charpos=0, parse1_side_effect=parse1_side_effect)
    # fillbuf should do nothing, so _tokens remains empty
    with pytest.raises(DummyPSEOF):
        nexttoken(parser)  # 1.42μs -> 1.50μs (5.60% slower)


def test_nexttoken_token_is_tuple_with_expected_types():
    # The returned token should be a tuple of (int, PSBaseParserToken)
    parser = DummyParser(tokens=[(5, PSBaseParserToken("tok5"))])
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 1.00μs -> 791ns (26.4% faster)


def test_nexttoken_does_not_consume_token_on_eof_exception():
    # If PSEOF is raised, _tokens should remain unmodified
    def parse1_side_effect(buf, charpos):
        raise DummyPSEOF("fail")

    parser = DummyParser(tokens=[], buf=b"abc", charpos=0, parse1_side_effect=parse1_side_effect)
    with pytest.raises(DummyPSEOF):
        nexttoken(parser)  # 1.33μs -> 1.33μs (0.000% faster)


# 3. LARGE SCALE TEST CASES


def test_nexttoken_large_number_of_tokens():
    # Should handle 1000 tokens, returning them in order
    tokens = [(i, PSBaseParserToken(f"tok{i}")) for i in range(1000)]
    parser = DummyParser(tokens=list(tokens))
    for i in range(1000):
        codeflash_output = nexttoken(parser)
        result = codeflash_output  # 319μs -> 256μs (24.4% faster)


def test_nexttoken_large_number_of_fillbuf_calls():
    # Should handle 1000 fillbuf calls before a token appears
    call_count = [0]

    def fillbuf():
        call_count[0] += 1
        if call_count[0] == 1000:
            parser._tokens.append((123, PSBaseParserToken("final")))

    parser = DummyParser(tokens=[])
    parser.fillbuf = fillbuf
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 215μs -> 216μs (0.327% slower)


def test_nexttoken_large_token_object():
    # Token with a large payload
    large_value = "A" * 10000
    parser = DummyParser(tokens=[(7, PSBaseParserToken(large_value))])
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 708ns -> 583ns (21.4% faster)


def test_nexttoken_large_buf_parsing():
    # Large buf passed to _parse1
    buf = b"B" * 999
    called = []

    def parse1_side_effect(b, c):
        called.append((b, c))
        parser._tokens.append((1, PSBaseParserToken("bigbuf")))
        return c + 1

    parser = DummyParser(tokens=[], buf=buf, charpos=0, parse1_side_effect=parse1_side_effect)
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 1.88μs -> 1.83μs (2.24% faster)


# 4. ADDITIONAL EDGE CASES


def test_nexttoken_token_is_removed_from_tokens_list():
    # After returning, token is removed from _tokens
    parser = DummyParser(tokens=[(1, PSBaseParserToken("tok1")), (2, PSBaseParserToken("tok2"))])
    codeflash_output = nexttoken(parser)
    t1 = codeflash_output  # 708ns -> 583ns (21.4% faster)
    codeflash_output = nexttoken(parser)
    t2 = codeflash_output  # 375ns -> 333ns (12.6% faster)


def test_nexttoken_multiple_rounds_until_token():
    # Several fillbuf/_parse1 rounds before token appears
    rounds = [0]

    def fillbuf():
        rounds[0] += 1
        if rounds[0] == 5:
            parser._tokens.append((50, PSBaseParserToken("fifth")))

    parser = DummyParser(tokens=[])
    parser.fillbuf = fillbuf
    codeflash_output = nexttoken(parser)
    result = codeflash_output  # 3.50μs -> 3.25μs (7.69% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-nexttoken-mjcgqs8l and push.

Codeflash Static Badge

The optimization achieves a **14% speedup** by eliminating unnecessary logging overhead through conditional log level checking.

**Key optimization applied:**
- **Conditional logging check**: Replaced the unconditional `log.debug("nexttoken: %r", token)` call with `if log.isEnabledFor(10): log.debug("nexttoken: %r", token)` where `10` is the `logging.DEBUG` level constant.

**Why this optimization works:**
The line profiler results show that the original `log.debug()` call consumed **30.6% of total execution time** (2.031μs out of 6.635μs). Even when debug logging is disabled, Python's logging system still performs expensive string formatting and method resolution. The `isEnabledFor()` check is a lightweight integer comparison that completely bypasses the costly debug call when logging is disabled.

**Performance impact by test case:**
- **Simple token access**: 21-38% faster for basic token retrieval operations
- **Buffer filling scenarios**: 8-11% faster when tokens need to be parsed
- **Large-scale operations**: 24% faster when processing 1000+ tokens sequentially
- **Edge cases**: Minimal impact (0-6%) for exception handling paths

**Real-world benefits:**
This optimization is particularly valuable because:
1. **Production environments** typically run with logging disabled or at higher levels than DEBUG
2. **Token parsing** is likely called frequently in PDF processing pipelines
3. The speedup compounds when processing large documents with many tokens

The optimization preserves all original behavior while eliminating a significant performance bottleneck through standard Python logging best practices.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 19, 2025 06:03
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant