Raise a clear error for empty token lists in bad_words_ids / sequence_bias by Sunt-ing · Pull Request #46994 · huggingface/transformers

Sunt-ing · 2026-07-01T06:58:38Z

What does this PR do?

NoBadWordsLogitsProcessor and the list form of SequenceBiasLogitsProcessor currently accept an empty token sequence and then fail later with IndexError: tuple index out of range when generation tries to read the last token of that sequence. This is reachable through the documented bad_words_ids pattern because tokenizing an empty string with add_special_tokens=False returns [].

This PR rejects empty token sequences at construction time for both entry points. The dict form of sequence_bias already rejects empty sequences; this makes bad_words_ids and the list form of sequence_bias follow the same contract, while non-empty entries keep their existing behavior.

Reproduction and before/after output (CPU)

Run from the repo root:

CUDA_VISIBLE_DEVICES="" PYTHONPATH=src:. python - <<'PY'
import importlib
import sys

import torch

sys.path.insert(0, ".")

from tests.causal_lm_tester import CausalLMModelTester
from transformers import set_seed
from transformers.generation.logits_process import SequenceBiasLogitsProcessor

torch.set_grad_enabled(False)


def find_tester(model_name):
    module = importlib.import_module(f"tests.models.{model_name}.test_modeling_{model_name}")
    for name in dir(module):
        obj = getattr(module, name)
        if isinstance(obj, type) and issubclass(obj, CausalLMModelTester) and obj is not CausalLMModelTester:
            return obj
    raise RuntimeError("no tester")


def build():
    tester_cls = find_tester("llama")
    tester = tester_cls(parent=None)
    config = tester.get_config()
    config.use_cache = True
    set_seed(0)
    model = tester.causal_lm_class(config).to(torch.float32).eval()
    model.config._attn_implementation = "eager"
    model.generation_config.pad_token_id = 0
    model.generation_config.eos_token_id = 2
    vocab_size = config.get_text_config().vocab_size
    set_seed(4)
    input_ids = torch.randint(3, vocab_size, (1, 8))
    return model, input_ids


def trial(label, bad_words_ids):
    model, input_ids = build()
    try:
        model.generate(
            input_ids,
            attention_mask=torch.ones_like(input_ids),
            do_sample=False,
            max_new_tokens=12,
            use_cache=True,
            bad_words_ids=bad_words_ids,
        )
        print(f"{label:38} OK")
    except Exception as exc:
        print(f"{label:38} {type(exc).__name__}: {str(exc)[:80]}")


trial("BUG  bad_words_ids=[[]]", [[]])
trial("CTRL bad_words_ids=[[3,4]]", [[3, 4]])
trial("BUG  empty via tokenize ['', 'x']", [[], [3]])

try:
    SequenceBiasLogitsProcessor([[[], 1.0]])(torch.tensor([[5, 6, 7]]), torch.randn(1, 50))
    print(f"{'BUG  sequence_bias=[[[],1.0]]':38} OK")
except Exception as exc:
    print(f"{'BUG  sequence_bias=[[[],1.0]]':38} {type(exc).__name__}: {str(exc)[:80]}")
PY

Before:

BUG  bad_words_ids=[[]]              IndexError: tuple index out of range
CTRL bad_words_ids=[[3,4]]           OK
BUG  empty via tokenize ['', 'x']     IndexError: tuple index out of range
BUG  sequence_bias=[[[],1.0]]         IndexError: tuple index out of range

After:

BUG  bad_words_ids=[[]]              ValueError: `bad_words_ids` has to be a list of non-empty lists, but is [[]].
CTRL bad_words_ids=[[3,4]]           OK
BUG  empty via tokenize ['', 'x']     ValueError: `bad_words_ids` has to be a list of non-empty lists, but is [[], [3]].
BUG  sequence_bias=[[[],1.0]]         ValueError: Each element in `sequence_bias` has to be a non-empty list of lists of positive integers and float, but is [[[], 1.0]].

A focused regression test is added in tests/generation/test_logits_process.py for bad_words_ids=[[]], mixed empty and non-empty bad_words_ids, and sequence_bias=[[[], 1.0]]. It fails on main because no ValueError is raised at construction time, and passes with this change.

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline and the Pull Request checks?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes according to the guidelines?
Did you write any new necessary tests?

Who can review?

@Cyrilvallez

…_bias NoBadWordsLogitsProcessor and the list form of SequenceBiasLogitsProcessor accept an empty inner token list and only fail later, during generation, with an opaque IndexError instead of a clear validation error at construction time. The empty list is reachable from the documented usage, since some words tokenize to an empty list (e.g. tokenizer("", add_special_tokens=False).input_ids == []). The dict form of sequence_bias already rejects empty sequences with a clear ValueError; this adds the same check to the two paths that missed it (the vacuous all/any over an empty list let it slip through), so an empty token list now raises a clear ValueError at construction. Signed-off-by: Ting Sun <suntcrick@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Raise a clear error for empty token lists in bad_words_ids / sequence_bias#46994

Raise a clear error for empty token lists in bad_words_ids / sequence_bias#46994
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:E-002-nobadwords-empty

Sunt-ing commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Sunt-ing commented Jul 1, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant