Skip to content

[Fix] conversation_to_ids: truncation guard crashes with "'list' object has no attribute 'shape'" (#581)#1119

Open
yushuosun wants to merge 1 commit into
OpenBMB:mainfrom
yushuosun:fix/finetune-dataset-truncation-attrerror
Open

[Fix] conversation_to_ids: truncation guard crashes with "'list' object has no attribute 'shape'" (#581)#1119
yushuosun wants to merge 1 commit into
OpenBMB:mainfrom
yushuosun:fix/finetune-dataset-truncation-attrerror

Conversation

@yushuosun

Copy link
Copy Markdown

Motivation

Finetuning crashes on any sample longer than max_length with AttributeError: 'list' object has no attribute 'shape', reported in #581 (and #578). The bug is still present on current main (8a2db68).

Root cause

In finetune/dataset.py, conversation_to_ids() hstacks the per-segment arrays into the tensor ids:

ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))

but the over-length guard still inspects input_ids, which is the raw Python list returned by conversation_to_ids_minicpm/llama3/qwen2 and has no .shape:

if input_ids.shape[-1] > max_length:                                   # line 147
    ids = ids[:max_length]
    context = context[:max_length]
    logger.warning(f"The input length ({input_ids.shape[-1]}) ...")     # line 150

So instead of truncating and warning, the run aborts the moment a tokenized sample exceeds max_length (default 2048).

Modifications

finetune/dataset.py: use the ids tensor in both the guard (line 147) and the warning message (line 150). Two-token change; the only behavioral effect is that the intended truncation path now works.

Duplicate-check

conversation_to_ids() builds the token tensor as `ids` via np.hstack(input_ids),
but the over-length truncation guard still reads `input_ids.shape[-1]`.
`input_ids` is the raw Python list returned by conversation_to_ids_* and has no
`.shape`, so any sample longer than max_length raises
`AttributeError: 'list' object has no attribute 'shape'` and aborts finetuning
instead of truncating. Use the `ids` tensor in both the guard and the warning.
Copilot AI review requested due to automatic review settings June 28, 2026 14:56

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a finetuning crash in conversation_to_ids() when tokenized samples exceed max_length by ensuring the truncation guard operates on the stacked tensor rather than the original Python list of segments.

Changes:

  • Switches the over-length check from input_ids.shape (invalid for lists) to ids.shape.
  • Updates the warning message to reference ids instead of input_ids.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread finetune/dataset.py
Comment on lines +147 to +150
if ids.shape[-1] > max_length:
ids =ids[:max_length]
context = context[:max_length]
logger.warning(f"The input length ({input_ids.shape[-1]}) exceeds the model's maximum length ({max_length}), so it has been truncated")
logger.warning(f"The input length ({ids.shape[-1]}) exceeds the model's maximum length ({max_length}), so it has been truncated")
@tc-mb

tc-mb commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Hi @yushuosun, thanks for the PR.

Just a heads-up — the finetune code in the main MiniCPM-V repo is no longer actively maintained. The official finetune scripts have been moved to the new MiniCPM-V Cookbook repo:

The conversation_to_ids function in the Cookbook needs the same verification and fix — you can confirm over there.

Two options:

  1. You submit the PR to the Cookbook repo: patch finetune/official/dataset.py in OpenSQZ/MiniCPM-V-Cookbook. The PR and commits remain entirely your contribution.
  2. We submit it on our end: if we file it, we can only add a note in the commit message crediting you, which may not fully reflect your contribution.

Which would you prefer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants