Skip to content

Raise a clear error for empty token lists in bad_words_ids / sequence_bias#46994

Open
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:E-002-nobadwords-empty
Open

Raise a clear error for empty token lists in bad_words_ids / sequence_bias#46994
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:E-002-nobadwords-empty

Conversation

@Sunt-ing

@Sunt-ing Sunt-ing commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

NoBadWordsLogitsProcessor and the list form of SequenceBiasLogitsProcessor currently accept an empty token sequence and then fail later with IndexError: tuple index out of range when generation tries to read the last token of that sequence. This is reachable through the documented bad_words_ids pattern because tokenizing an empty string with add_special_tokens=False returns [].

This PR rejects empty token sequences at construction time for both entry points. The dict form of sequence_bias already rejects empty sequences; this makes bad_words_ids and the list form of sequence_bias follow the same contract, while non-empty entries keep their existing behavior.

Reproduction and before/after output (CPU)

Run from the repo root:

CUDA_VISIBLE_DEVICES="" PYTHONPATH=src:. python - <<'PY'
import importlib
import sys

import torch

sys.path.insert(0, ".")

from tests.causal_lm_tester import CausalLMModelTester
from transformers import set_seed
from transformers.generation.logits_process import SequenceBiasLogitsProcessor

torch.set_grad_enabled(False)


def find_tester(model_name):
    module = importlib.import_module(f"tests.models.{model_name}.test_modeling_{model_name}")
    for name in dir(module):
        obj = getattr(module, name)
        if isinstance(obj, type) and issubclass(obj, CausalLMModelTester) and obj is not CausalLMModelTester:
            return obj
    raise RuntimeError("no tester")


def build():
    tester_cls = find_tester("llama")
    tester = tester_cls(parent=None)
    config = tester.get_config()
    config.use_cache = True
    set_seed(0)
    model = tester.causal_lm_class(config).to(torch.float32).eval()
    model.config._attn_implementation = "eager"
    model.generation_config.pad_token_id = 0
    model.generation_config.eos_token_id = 2
    vocab_size = config.get_text_config().vocab_size
    set_seed(4)
    input_ids = torch.randint(3, vocab_size, (1, 8))
    return model, input_ids


def trial(label, bad_words_ids):
    model, input_ids = build()
    try:
        model.generate(
            input_ids,
            attention_mask=torch.ones_like(input_ids),
            do_sample=False,
            max_new_tokens=12,
            use_cache=True,
            bad_words_ids=bad_words_ids,
        )
        print(f"{label:38} OK")
    except Exception as exc:
        print(f"{label:38} {type(exc).__name__}: {str(exc)[:80]}")


trial("BUG  bad_words_ids=[[]]", [[]])
trial("CTRL bad_words_ids=[[3,4]]", [[3, 4]])
trial("BUG  empty via tokenize ['', 'x']", [[], [3]])

try:
    SequenceBiasLogitsProcessor([[[], 1.0]])(torch.tensor([[5, 6, 7]]), torch.randn(1, 50))
    print(f"{'BUG  sequence_bias=[[[],1.0]]':38} OK")
except Exception as exc:
    print(f"{'BUG  sequence_bias=[[[],1.0]]':38} {type(exc).__name__}: {str(exc)[:80]}")
PY

Before:

BUG  bad_words_ids=[[]]              IndexError: tuple index out of range
CTRL bad_words_ids=[[3,4]]           OK
BUG  empty via tokenize ['', 'x']     IndexError: tuple index out of range
BUG  sequence_bias=[[[],1.0]]         IndexError: tuple index out of range

After:

BUG  bad_words_ids=[[]]              ValueError: `bad_words_ids` has to be a list of non-empty lists, but is [[]].
CTRL bad_words_ids=[[3,4]]           OK
BUG  empty via tokenize ['', 'x']     ValueError: `bad_words_ids` has to be a list of non-empty lists, but is [[], [3]].
BUG  sequence_bias=[[[],1.0]]         ValueError: Each element in `sequence_bias` has to be a non-empty list of lists of positive integers and float, but is [[[], 1.0]].

A focused regression test is added in tests/generation/test_logits_process.py for bad_words_ids=[[]], mixed empty and non-empty bad_words_ids, and sequence_bias=[[[], 1.0]]. It fails on main because no ValueError is raised at construction time, and passes with this change.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline and the Pull Request checks?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes according to the guidelines?
  • Did you write any new necessary tests?

Who can review?

@Cyrilvallez

…_bias

NoBadWordsLogitsProcessor and the list form of SequenceBiasLogitsProcessor accept
an empty inner token list and only fail later, during generation, with an opaque
IndexError instead of a clear validation error at construction time. The empty
list is reachable from the documented usage, since some words tokenize to an empty
list (e.g. tokenizer("", add_special_tokens=False).input_ids == []).

The dict form of sequence_bias already rejects empty sequences with a clear
ValueError; this adds the same check to the two paths that missed it (the vacuous
all/any over an empty list let it slip through), so an empty token list now raises
a clear ValueError at construction.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant