Skip to content

Add more useful trace and validation against position embeddings#302

Open
10LGUO wants to merge 2 commits into
ByteDance-Seed:mainfrom
10LGUO:fix/position-embedding-bounds-check
Open

Add more useful trace and validation against position embeddings#302
10LGUO wants to merge 2 commits into
ByteDance-Seed:mainfrom
10LGUO:fix/position-embedding-bounds-check

Conversation

@10LGUO
Copy link
Copy Markdown

@10LGUO 10LGUO commented May 11, 2026

Today when a dataset image exceeds the corresponding pixel limit, the data pipeline generates position IDs beyond the table size, which triggers a silent CUDA device-side assertion. This assertion kills the process with SIGABRT, giving no Python-level stack trace and no indication of which config value is wrong, , which is hard for coding agent to debug on the first go.

The constraints are:
VAE path: max_image_size ≤ max_latent_size × vae_image_downsample
VIT path: vit_max_image_size ≤ vit_max_num_patch_per_side × vit_patch_size

TRAIN.md already notes "if max_latent_size is not set correctly, an out-of-bounds error may occur", but there was no code-level guard.

This commit adds a preflight check on rank 0 immediately after the dataset config is loaded. It raises a descriptive ValueError before any GPU work starts, telling the user exactly which dataset and which values are in conflict and how to fix them.

co-authored by Claude Code.

10LGUO added 2 commits May 11, 2026 09:13
…g bounds at startup

Today when a
dataset image exceeds the corresponding pixel limit, the data pipeline
generates position IDs beyond the table size, which triggers a silent
CUDA device-side assertion. This assertion kills the process with SIGABRT, giving no Python-level stack trace and no indication of which config value is wrong, , which is hard for coding agent to debug on the first go.

The constraints are:
  VAE path: max_image_size ≤ max_latent_size × vae_image_downsample
  VIT path: vit_max_image_size ≤ vit_max_num_patch_per_side × vit_patch_size

TRAIN.md already notes "if max_latent_size is not set correctly, an
out-of-bounds error may occur", but there was no code-level guard.

This commit adds a preflight check on rank 0 immediately after the
dataset config is loaded. It raises a descriptive ValueError before
any GPU work starts, telling the user exactly which dataset and which
values are in conflict and how to fix them.
co-authored by: Claude Code.
Fallback to image_transform_args caused a false-positive ValueError on
datasets like t2i_pretrain that have no VIT path. Only check VIT bounds
when vit_image_transform_args is explicit or the dataset is a known VLM type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant