Skip to content

Validate per-host infeed batch scaling#425

Open
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-infeed-batch-divisibility
Open

Validate per-host infeed batch scaling#425
fallintoplace wants to merge 1 commit into
tensorflow:masterfrom
fallintoplace:fix-infeed-batch-divisibility

Conversation

@fallintoplace

Copy link
Copy Markdown

What changed

scale_global_to_infeed() now uses divmod() when per-host TPU infeed scaling is active and raises if the global batch size cannot be evenly divided by the TPU host count.

This matches the stricter behavior already used by scale_global_to_worker() and avoids silently truncating global_batch_size, for example turning 10 examples over 4 hosts into 8 effective examples.

History checked

scale_global_to_infeed() was introduced in 1594137c6 with floor division. The stricter scale_global_to_worker() precedent was added later in 8ce7aa894.

Tests

  • python3 -m py_compile lingvo/core/batch_utils.py lingvo/core/batch_utils_test.py
  • git diff --check -- lingvo/core/batch_utils.py lingvo/core/batch_utils_test.py
  • Local isolated harness for scale_global_to_infeed() divisible and non-divisible cases
  • USE_BAZEL_VERSION=5.3.0 npx -y @bazel/bazelisk test --experimental_repo_remote_exec //lingvo/core:batch_utils_test does not reach analysis in this checkout because @rules_cc//cc:cc_library.bzl is not declared/resolved by the workspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant