Some problems when using RICE-ViT

Hello @anxiangsir 🤗

I'm Yiming and study in NTU. Recently I’ve been working with RICE-ViT and trying to reproduce [baseline ](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/Yy09pOusaZ47LofJ27xox.jpeg) built on Qwen2.5-7B-Instruct. I ran into a couple of questions and would really appreciate your help:

### About reproducing ViT-L-14-336px results
I used [rice-vit-large-patch14-560](https://huggingface.co/DeepGlint-AI/rice-vit-large-patch14-560) and modify the `crop_size` and `shortest_edge` in [preprocessor_config](https://huggingface.co/DeepGlint-AI/rice-vit-large-patch14-560/blob/main/preprocessor_config.json) to 336, attempting to match the ViT-L-14-336px setup. Is this the correct way to reproduce the 336px version? If not, where can I find the checkpoint specifically trained for ViT-L-14-336px?

### Which MLCDVisionModel to use
I noticed that there are two version of MLCDVisionModel, 

- one in [LLaVA-NEXT](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/mlcd/vit_rope2d_hf.py#L275),

- another in [transformers](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mlcd/modeling_mlcd.py#L53) 

For RICE-ViT, I used the version from Transformers.
Is this the correct choice?

Thanks a lot for your time! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some problems when using RICE-ViT #8

About reproducing ViT-L-14-336px results

Which MLCDVisionModel to use

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Some problems when using RICE-ViT #8

Description

About reproducing ViT-L-14-336px results

Which MLCDVisionModel to use

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions