Add tiny_model_id support to ProcessorTesterMixin for memory-sensitive tests by ydshieh · Pull Request #47005 · huggingface/transformers

ydshieh · 2026-07-01T15:41:01Z

Summary

Adds tiny_model_id attribute to ProcessorTesterMixin allowing test classes to specify a lightweight Hub repo (e.g. containing only a tiny tokenizer) for use in most tests, while keeping model_id as the full Hub repo used only when explicitly requested via get_processor(use_full=True) / get_component(use_full=True)
When tiny_model_id is set, setUpClass loads the tiny processor into tmpdirname (used by all tests) and the full processor into full_tmpdirname (used only on demand)
Missing components in the tiny repo fall back to model_id during setup
Applies this to Fuyu, whose tokenizer has ~262K vocab tokens making tests slow and memory-intensive

Memory benchmark (Fuyu tests)

Before (main):

   Delta MB     End MB  Worker  Test
  ---------  ---------  ------  ------------------------------------------------------------
  +   347.7     2045.3    main  FuyuProcessingTest::test_args_overlap_kwargs
  +   329.8     2441.9    main  FuyuProcessingTest::test_processor_from_and_save_pretrained_as_nested_dict
  +    91.4     1682.1    main  FuyuProcessingTest::test_apply_chat_template_assistant_mask
  +    51.3     1677.8    main  FuyuProcessingTest::test_fuyu_processing
  +    49.8     1678.5    main  FuyuProcessingTest::test_fuyu_processing
  +    34.8     2112.1    main  FuyuProcessingTest::test_processor_from_and_save_pretrained
  +    17.4     1700.9    main  FuyuProcessingTest::test_unstructured_kwargs_batched
  +    12.5     1691.0    main  FuyuProcessingTest::test_fuyu_processing_multiple_image_sample
  ...

  Worker   Tests   Peak RSS MB   Net growth MB
    main     195       2451.4   +       833.6

  28 passed, 37 skipped in 103.82s (0:01:43)

After (this PR):

   Delta MB     End MB  Worker  Test
  ---------  ---------  ------  ------------------------------------------------------------
  +    51.3     1677.8    main  FuyuProcessingTest::test_fuyu_processing
  +    49.8     1678.5    main  FuyuProcessingTest::test_fuyu_processing
  +    17.4     1700.9    main  FuyuProcessingTest::test_unstructured_kwargs_batched
  +    12.5     1691.0    main  FuyuProcessingTest::test_fuyu_processing_multiple_image_sample
  +    12.1     1703.6    main  FuyuProcessingTest::test_unstructured_kwargs_batched
  +     9.8     1628.7    main  FuyuProcessingTest::test_flat_kwarg_applied_when_modality_dict_lacks_it
  +     9.2     1626.5    main  FuyuProcessingTest::test_flat_kwarg_applied_when_modality_dict_lacks_it
  +     5.2     1683.0    main  FuyuProcessingTest::test_fuyu_processing_multiple_image_sample
  ...

  Worker   Tests   Peak RSS MB   Net growth MB
    main     130       1704.0   +        83.5

  28 passed, 37 skipped in 14.45s

Metric	`main`	This PR	Improvement
Peak RSS	2451.4 MB	1704.0 MB	-747 MB
Net growth	+833.6 MB	+83.5 MB	10x reduction
Runtime	103.82s	14.45s	7x faster

Test plan

Run pytest tests/models/fuyu/test_processing_fuyu.py and verify all existing tests pass
Verify memory usage is reduced compared to main using the memory tracker plugin

…e tests Fuyu's tokenizer has ~262K vocab tokens, making its processor tests slow and memory-intensive. This introduces a tiny_model_id attribute on ProcessorTesterMixin so test classes can point to a lightweight Hub repo (e.g. only a tiny tokenizer) used by default, while the full processor is loaded on demand via use_full=True. On Fuyu tests: 10x reduction in net memory growth (833 MB → 83 MB), 7x speedup (103s → 14s). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-07-01T16:00:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-07-02T09:56:00Z

-    def get_processor(self):
-        processor = self.processor_class.from_pretrained(self.tmpdirname)
-        return processor
+    def get_processor(self, use_full=False):


nit: re naming, could we do a bit more descriptive one, like use_tiny_ckpt or smth along that line

OK I can change the arugment name.

zucchini-nlp · 2026-07-02T10:00:01Z

+                        # tiny repo may be missing some components; fall back to cls.model_id
+                        try:
+                            custom_components[attribute] = component_class.from_pretrained(model_id)
+                        except Exception:
+                            custom_components[attribute] = component_class.from_pretrained(cls.model_id)


when does this happen?

The new repo I create

https://huggingface.co/hf-internal-testing/fuyu-tiny-tokenizer/tree/main

only have tokenizer stuff and config.json, not the other image processor stuff.

I am focus on creating some (a few) tiny tokenizer. But if you prefer, I could always push all stuff to the newly create repo. so we don't have this try/except branch.

yeah, I think we can host tiny-model-id-processor repos in a similar fashion to weights-only test repos. Feels weird having to load half of files in different repos

zucchini-nlp · 2026-07-02T10:00:42Z

+        if model_id == cls.tiny_model_id:
+            # tiny repo may be incomplete; all components were individually loaded above with fallback
+            # to cls.model_id. Construct directly since from_pretrained would fail on missing files.
+            processor = cls.processor_class(**custom_components, **kwargs)


this is also interesting, if components are missing, shouldn't we call from_pretrained to download missing components?

from_pretrained in this case will fail if the repo itself don't have the full list of files to load all components, just like

https://huggingface.co/hf-internal-testing/fuyu-tiny-tokenizer/tree/main

only have tiny tokenizer.

Similar to #47005 (comment), I can always push all stuff to the newly creating tiny repo., but it distract from the goal of "creating tiny tokenizer" which is what we only need.

I revert the change. The new repo should contain the complete list of processor files (but tiny version like tokenizer)

github-actions · 2026-07-03T10:02:21Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: fuyu

github-actions · 2026-07-03T11:32:53Z

CI recap

Dashboard: View test results in Grafana
Latest run: 28653280321:1
Result: success | Jobs: 14 | Tests: 32,621 | Failures: 62 | Duration: 2h 46m

zucchini-nlp

Much cleaner and easier to read, thanks! Can we add a small test to check how heavy the processors are, same way as we have test_is_model_small?

Not now ofc, but with that we can nudge users to create a testing processor if mem usage goes beyond expected tol

ydshieh · 2026-07-03T12:23:09Z

It's a good idea, but maybe after I upload the script that is used to create the tiny tokenizer and upload more tiny tokenizers first.

ydshieh force-pushed the tiny_model_id_for_processing_tests branch from 174d02d to 6cc4106 Compare July 1, 2026 15:42

style

58e575e

ydshieh requested review from molbap and zucchini-nlp July 1, 2026 18:12

zucchini-nlp reviewed Jul 2, 2026

View reviewed changes

ydshieh added 2 commits July 3, 2026 11:57

new repo

ddd220c

update

73268f6

ydshieh requested a review from zucchini-nlp July 3, 2026 11:38

zucchini-nlp reviewed Jul 3, 2026

View reviewed changes

zucchini-nlp approved these changes Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

ydshieh commented Jul 1, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory benchmark (Fuyu tests)

Test plan

Uh oh!

HuggingFaceDocBuilderDev commented Jul 1, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

CI recap

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ydshieh commented Jul 1, 2026 •

edited by github-actions Bot

Loading