Handle sparse BOW token ids by fallintoplace · Pull Request #426 · tensorflow/lingvo

fallintoplace · 2026-06-20T22:52:04Z

What changed

Vocab::GetBowTokenIds() now sizes its result by the largest non-negative token ID rather than by the number of tokens. That keeps sparse explicit vocab IDs from indexing past the end of the returned vector.

Vocab::Load() now tracks the largest non-negative ID while loading. It preserves existing support for negative non-BOW token IDs, but rejects negative IDs on BOW-prefixed tokens because those cannot be represented in a vector indexed by token ID.

A new C++ regression test covers a sparse BOW token ID at 100, a negative non-BOW ID, and rejection of a negative BOW token ID.

History checked

GetBowTokenIds() was added in 9d15517c8 using id_to_token_.size() as the vector size. That assumes dense IDs, while load_token_ids_from_vocab=True supports explicit IDs.

Tests

python3 -m py_compile lingvo/core/ops/simple_vocab_test.py
git diff --check -- lingvo/core/ops/simple_vocab.cc lingvo/core/ops/simple_vocab.h lingvo/core/ops/simple_vocab_cc_test.cc lingvo/core/ops/BUILD
USE_BAZEL_VERSION=5.3.0 npx -y @bazel/bazelisk test --experimental_repo_remote_exec //lingvo/core/ops:simple_vocab_cc_test does not reach analysis in this checkout because @rules_cc//cc:cc_library.bzl is not declared/resolved by the workspace.
USE_BAZEL_VERSION=5.3.0 npx -y @bazel/bazelisk test --experimental_repo_remote_exec //lingvo/core/ops:simple_vocab_test fails at the same workspace loading step.

Handle sparse BOW token ids

a4e77e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle sparse BOW token ids#426

Handle sparse BOW token ids#426
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-bow-token-id-sparse-vocab

fallintoplace commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fallintoplace commented Jun 20, 2026

What changed

History checked

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant