Skip to content

Validate SimpleVocab special tokens before reading IDs#423

Open
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-simple-vocab-special-token-validation
Open

Validate SimpleVocab special tokens before reading IDs#423
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-simple-vocab-special-token-validation

Conversation

@fallintoplace

Copy link
Copy Markdown

What changed

SimpleVocab now validates the selected special-token family before reading any IDs from token_to_id_. This avoids using operator[] during validation, which could insert a missing SOS token with ID 0 and let a malformed vocab load successfully.

The loader now also rejects mixed lowercase/uppercase special-token families and duplicate IDs among the required special tokens. The duplicate check is intentionally scoped to SOS/EOS/UNK so existing vocab aliases outside those required tokens are not affected.

Tests

  • python3 -m py_compile lingvo/core/ops/simple_vocab_test.py
  • git diff --check -- lingvo/core/ops/simple_vocab.cc lingvo/core/ops/simple_vocab_test.py
  • USE_BAZEL_VERSION=5.3.0 npx -y @bazel/bazelisk test --experimental_repo_remote_exec //lingvo/core/ops:simple_vocab_test did not reach analysis in this checkout because @rules_cc//cc:cc_library.bzl is not declared/resolved by the workspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant