[Fix] conversation_to_ids: truncation guard crashes with "'list' object has no attribute 'shape'" (#581) by yushuosun · Pull Request #1119 · OpenBMB/MiniCPM-V

yushuosun · 2026-06-28T14:56:16Z

Motivation

Finetuning crashes on any sample longer than max_length with AttributeError: 'list' object has no attribute 'shape', reported in #581 (and #578). The bug is still present on current main (8a2db68).

Root cause

In finetune/dataset.py, conversation_to_ids() hstacks the per-segment arrays into the tensor ids:

ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))

but the over-length guard still inspects input_ids, which is the raw Python list returned by conversation_to_ids_minicpm/llama3/qwen2 and has no .shape:

if input_ids.shape[-1] > max_length:                                   # line 147
    ids = ids[:max_length]
    context = context[:max_length]
    logger.warning(f"The input length ({input_ids.shape[-1]}) ...")     # line 150

So instead of truncating and warning, the run aborts the moment a tokenized sample exceeds max_length (default 2048).

Modifications

finetune/dataset.py: use the ids tensor in both the guard (line 147) and the warning message (line 150). Two-token change; the only behavioral effect is that the intended truncation path now works.

Duplicate-check

Track A — open PRs referencing the issue:
```
gh api "search/issues?q=repo:OpenBMB/MiniCPM-V+is:pr+581+in:body"  ->  no open PR
```
Other open PRs touching finetune/ do not touch this block: fix(docs): correct LLM_TYPE for MiniCPM-V-4 from 'llama' to 'llama3' #1088 (docs), Update trainer.py #1010 (trainer.py), feat: Added judgment logic to support training with plain text data. #281 (__len__/data_collator).
Prior PR fix finetune minicpm error #579 proposed a partial fix (line 147 only) but was closed unmerged by its author; line 150 still carries the identical bug.
Track B — issue thread has no open/in-progress claim.

conversation_to_ids() builds the token tensor as `ids` via np.hstack(input_ids), but the over-length truncation guard still reads `input_ids.shape[-1]`. `input_ids` is the raw Python list returned by conversation_to_ids_* and has no `.shape`, so any sample longer than max_length raises `AttributeError: 'list' object has no attribute 'shape'` and aborts finetuning instead of truncating. Use the `ids` tensor in both the guard and the warning.

Copilot

Pull request overview

Fixes a finetuning crash in conversation_to_ids() when tokenized samples exceed max_length by ensuring the truncation guard operates on the stacked tensor rather than the original Python list of segments.

Changes:

Switches the over-length check from input_ids.shape (invalid for lists) to ids.shape.
Updates the warning message to reference ids instead of input_ids.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if ids.shape[-1] > max_length:
        ids =ids[:max_length]
        context = context[:max_length]
-        logger.warning(f"The input length ({input_ids.shape[-1]}) exceeds the model's maximum length ({max_length}), so it has been truncated")
+        logger.warning(f"The input length ({ids.shape[-1]}) exceeds the model's maximum length ({max_length}), so it has been truncated")


tc-mb · 2026-06-29T08:11:49Z

Hi @yushuosun, thanks for the PR.

Just a heads-up — the finetune code in the main MiniCPM-V repo is no longer actively maintained. The official finetune scripts have been moved to the new MiniCPM-V Cookbook repo:

Repo: https://github.com/OpenSQZ/MiniCPM-V-Cookbook
Corresponding file: finetune/official/dataset.py
CookBook docs: https://opensqz.github.io/MiniCPM-V-CookBook/site/en/shared/finetune/official.html

The conversation_to_ids function in the Cookbook needs the same verification and fix — you can confirm over there.

Two options:

You submit the PR to the Cookbook repo: patch finetune/official/dataset.py in OpenSQZ/MiniCPM-V-Cookbook. The PR and commits remain entirely your contribution.
We submit it on our end: if we file it, we can only add a note in the commit message crediting you, which may not fully reflect your contribution.

Which would you prefer?

Copilot AI review requested due to automatic review settings June 28, 2026 14:56

Copilot started reviewing on behalf of yushuosun June 28, 2026 14:56 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] conversation_to_ids: truncation guard crashes with "'list' object has no attribute 'shape'" (#581)#1119

[Fix] conversation_to_ids: truncation guard crashes with "'list' object has no attribute 'shape'" (#581)#1119
yushuosun wants to merge 1 commit into
OpenBMB:mainfrom
yushuosun:fix/finetune-dataset-truncation-attrerror

yushuosun commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tc-mb commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yushuosun commented Jun 28, 2026

Motivation

Root cause

Modifications

Duplicate-check

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tc-mb commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tc-mb commented Jun 29, 2026 •

edited

Loading