Add tiny_model_id support to ProcessorTesterMixin for memory-sensitive tests#47005
Add tiny_model_id support to ProcessorTesterMixin for memory-sensitive tests#47005ydshieh wants to merge 4 commits into
Conversation
…e tests Fuyu's tokenizer has ~262K vocab tokens, making its processor tests slow and memory-intensive. This introduces a tiny_model_id attribute on ProcessorTesterMixin so test classes can point to a lightweight Hub repo (e.g. only a tiny tokenizer) used by default, while the full processor is loaded on demand via use_full=True. On Fuyu tests: 10x reduction in net memory growth (833 MB → 83 MB), 7x speedup (103s → 14s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
174d02d to
6cc4106
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| def get_processor(self): | ||
| processor = self.processor_class.from_pretrained(self.tmpdirname) | ||
| return processor | ||
| def get_processor(self, use_full=False): |
There was a problem hiding this comment.
nit: re naming, could we do a bit more descriptive one, like use_tiny_ckpt or smth along that line
There was a problem hiding this comment.
OK I can change the arugment name.
| # tiny repo may be missing some components; fall back to cls.model_id | ||
| try: | ||
| custom_components[attribute] = component_class.from_pretrained(model_id) | ||
| except Exception: | ||
| custom_components[attribute] = component_class.from_pretrained(cls.model_id) |
There was a problem hiding this comment.
The new repo I create
https://huggingface.co/hf-internal-testing/fuyu-tiny-tokenizer/tree/main
only have tokenizer stuff and config.json, not the other image processor stuff.
I am focus on creating some (a few) tiny tokenizer. But if you prefer, I could always push all stuff to the newly create repo. so we don't have this try/except branch.
There was a problem hiding this comment.
yeah, I think we can host tiny-model-id-processor repos in a similar fashion to weights-only test repos. Feels weird having to load half of files in different repos
| if model_id == cls.tiny_model_id: | ||
| # tiny repo may be incomplete; all components were individually loaded above with fallback | ||
| # to cls.model_id. Construct directly since from_pretrained would fail on missing files. | ||
| processor = cls.processor_class(**custom_components, **kwargs) |
There was a problem hiding this comment.
this is also interesting, if components are missing, shouldn't we call from_pretrained to download missing components?
There was a problem hiding this comment.
from_pretrained in this case will fail if the repo itself don't have the full list of files to load all components, just like
https://huggingface.co/hf-internal-testing/fuyu-tiny-tokenizer/tree/main
only have tiny tokenizer.
Similar to #47005 (comment), I can always push all stuff to the newly creating tiny repo., but it distract from the goal of "creating tiny tokenizer" which is what we only need.
There was a problem hiding this comment.
I revert the change. The new repo should contain the complete list of processor files (but tiny version like tokenizer)
|
[For maintainers] Suggested jobs to run (before merge) run-slow: fuyu |
CI recapDashboard: View test results in Grafana |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Much cleaner and easier to read, thanks! Can we add a small test to check how heavy the processors are, same way as we have test_is_model_small?
Not now ofc, but with that we can nudge users to create a testing processor if mem usage goes beyond expected tol
|
It's a good idea, but maybe after I upload the script that is used to create the tiny tokenizer and upload more tiny tokenizers first. |
Summary
tiny_model_idattribute toProcessorTesterMixinallowing test classes to specify a lightweight Hub repo (e.g. containing only a tiny tokenizer) for use in most tests, while keepingmodel_idas the full Hub repo used only when explicitly requested viaget_processor(use_full=True)/get_component(use_full=True)tiny_model_idis set,setUpClassloads the tiny processor intotmpdirname(used by all tests) and the full processor intofull_tmpdirname(used only on demand)model_idduring setupMemory benchmark (Fuyu tests)
Before (
main):After (this PR):
mainTest plan
pytest tests/models/fuyu/test_processing_fuyu.pyand verify all existing tests passmainusing the memory tracker plugin