[Fix] conversation_to_ids: truncation guard crashes with "'list' object has no attribute 'shape'" (#581)#1119
Open
yushuosun wants to merge 1 commit into
Conversation
conversation_to_ids() builds the token tensor as `ids` via np.hstack(input_ids), but the over-length truncation guard still reads `input_ids.shape[-1]`. `input_ids` is the raw Python list returned by conversation_to_ids_* and has no `.shape`, so any sample longer than max_length raises `AttributeError: 'list' object has no attribute 'shape'` and aborts finetuning instead of truncating. Use the `ids` tensor in both the guard and the warning.
There was a problem hiding this comment.
Pull request overview
Fixes a finetuning crash in conversation_to_ids() when tokenized samples exceed max_length by ensuring the truncation guard operates on the stacked tensor rather than the original Python list of segments.
Changes:
- Switches the over-length check from
input_ids.shape(invalid for lists) toids.shape. - Updates the warning message to reference
idsinstead ofinput_ids.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+147
to
+150
| if ids.shape[-1] > max_length: | ||
| ids =ids[:max_length] | ||
| context = context[:max_length] | ||
| logger.warning(f"The input length ({input_ids.shape[-1]}) exceeds the model's maximum length ({max_length}), so it has been truncated") | ||
| logger.warning(f"The input length ({ids.shape[-1]}) exceeds the model's maximum length ({max_length}), so it has been truncated") |
Collaborator
|
Hi @yushuosun, thanks for the PR. Just a heads-up — the finetune code in the main MiniCPM-V repo is no longer actively maintained. The official finetune scripts have been moved to the new MiniCPM-V Cookbook repo:
The conversation_to_ids function in the Cookbook needs the same verification and fix — you can confirm over there. Two options:
Which would you prefer? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Finetuning crashes on any sample longer than
max_lengthwithAttributeError: 'list' object has no attribute 'shape', reported in #581 (and #578). The bug is still present on currentmain(8a2db68).Root cause
In
finetune/dataset.py,conversation_to_ids()hstacks the per-segment arrays into the tensorids:but the over-length guard still inspects
input_ids, which is the raw Python list returned byconversation_to_ids_minicpm/llama3/qwen2and has no.shape:So instead of truncating and warning, the run aborts the moment a tokenized sample exceeds
max_length(default 2048).Modifications
finetune/dataset.py: use theidstensor in both the guard (line 147) and the warning message (line 150). Two-token change; the only behavioral effect is that the intended truncation path now works.Duplicate-check
finetune/do not touch this block: fix(docs): correct LLM_TYPE for MiniCPM-V-4 from 'llama' to 'llama3' #1088 (docs), Update trainer.py #1010 (trainer.py), feat: Added judgment logic to support training with plain text data. #281 (__len__/data_collator).