Skip to content

Handle sparse BOW token ids#426

Open
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-bow-token-id-sparse-vocab
Open

Handle sparse BOW token ids#426
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-bow-token-id-sparse-vocab

Conversation

@fallintoplace

Copy link
Copy Markdown

What changed

Vocab::GetBowTokenIds() now sizes its result by the largest non-negative token ID rather than by the number of tokens. That keeps sparse explicit vocab IDs from indexing past the end of the returned vector.

Vocab::Load() now tracks the largest non-negative ID while loading. It preserves existing support for negative non-BOW token IDs, but rejects negative IDs on BOW-prefixed tokens because those cannot be represented in a vector indexed by token ID.

A new C++ regression test covers a sparse BOW token ID at 100, a negative non-BOW ID, and rejection of a negative BOW token ID.

History checked

GetBowTokenIds() was added in 9d15517c8 using id_to_token_.size() as the vector size. That assumes dense IDs, while load_token_ids_from_vocab=True supports explicit IDs.

Tests

  • python3 -m py_compile lingvo/core/ops/simple_vocab_test.py
  • git diff --check -- lingvo/core/ops/simple_vocab.cc lingvo/core/ops/simple_vocab.h lingvo/core/ops/simple_vocab_cc_test.cc lingvo/core/ops/BUILD
  • USE_BAZEL_VERSION=5.3.0 npx -y @bazel/bazelisk test --experimental_repo_remote_exec //lingvo/core/ops:simple_vocab_cc_test does not reach analysis in this checkout because @rules_cc//cc:cc_library.bzl is not declared/resolved by the workspace.
  • USE_BAZEL_VERSION=5.3.0 npx -y @bazel/bazelisk test --experimental_repo_remote_exec //lingvo/core/ops:simple_vocab_test fails at the same workspace loading step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant