Skip to content

Add support for LoRA with Transformer Engine#3048

Open
balvisio wants to merge 5 commits intohuggingface:mainfrom
balvisio:dev/ba/support-te-lora
Open

Add support for LoRA with Transformer Engine#3048
balvisio wants to merge 5 commits intohuggingface:mainfrom
balvisio:dev/ba/support-te-lora

Conversation

@balvisio
Copy link

This PR adds support so that TransformerEngine layers (https://github.com/NVIDIA/TransformerEngine) are recognized as valid layers to which LoRA adapters can be added.

@BenjaminBossan
Copy link
Member

Thanks for this PR to add support for TE in PEFT. Before proceeding further, do you have a practical example of how to use it with a transformers model? I assume one way would be for the user to employ accelerate to apply TE. Also, I saw that you intend to replace the nn.Linear layers used for lora_A and lora_B with TE's low FP precision layers. Did you test if that works well? Usually, we would keep these layers in fp32 (or fp16/bf16), even if the base layer uses lower precision (say, int4).

@balvisio
Copy link
Author

Thank you for taking a look at this. The primary goal of this PR is to support models with TE layers already inside them. In fact, using te.Linear for the adapters is not strictly necessary. However, by default, this TE layers will use the torch default type.
For the example, we have created and validated a "recipe" to use fine-tune an ESM2 model using LoRA here: https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/esm2_peft_te

@BenjaminBossan
Copy link
Member

The primary goal of this PR is to support models with TE layers already inside them.

I see, could you please provide a small example of how this looks in practice? We should eventually add a unit test with a real model architecture anyway, so this example would also help for that.

using te.Linear for the adapters is not strictly necessary.

I think for consistency, we should stick with float32 by default. We could think of an option to use 4bit and 8bit floats but I'd have to think about how the API for that should look.

For the example, we have created and validated a "recipe" to use fine-tune an ESM2 model using LoRA here: https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/esm2_peft_te

I think it would be great to include such an example here. Maybe simplified if possible (using Trainer, no DDP, no extra logging). WDYT?

@balvisio
Copy link
Author

We should eventually add a unit test with a real model architecture anyway, so this example would also help for that.
There is a unit test that creates a toy model that uses TE layers. Is that what you meant ? Otherwise I can change the test to do: AutoModelForTokenClassification.from_pretrained("nvidia/esm2_t6_8M_UR50D", config=config, trust_remote_code=True, dtype="bfloat16"). lmk

I think for consistency, we should stick with float32 by default.
Do you mean to remove using TE layers as adapters? I can do that but just to make sure, currently the TE adapters will use the default torch type, not a low a precision necessarily.

I think it would be great to include such an example here
Where should I add the example exactly? Is there a doc with examples?

@BenjaminBossan
Copy link
Member

There is a unit test that creates a toy model that uses TE layers. Is that what you meant ? Otherwise I can change the test to do

Yeah, we can use a bit heavier models in test_gpu_examples.py, my goal would be to use a real model there, or at least a "tiny" version of a real model, instead of a toy model.

Do you mean to remove using TE layers as adapters?

Yes, so keep the base layers as they are and use normal nn.Linear for LoRA.

I can do that but just to make sure, currently the TE adapters will use the default torch type, not a low a precision necessarily.

Is there any advantage then?

Where should I add the example exactly? Is there a doc with examples?

Docs would be nice, but I was thinking of the examples/ folder.

@balvisio balvisio force-pushed the dev/ba/support-te-lora branch from 0a19da8 to ffb0f7b Compare February 25, 2026 12:57
@balvisio balvisio force-pushed the dev/ba/support-te-lora branch from ffb0f7b to d4984b1 Compare February 25, 2026 19:41
@balvisio
Copy link
Author

@BenjaminBossan : I changed the tests to use a TE-based model and added an example.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR. I could successfully run the unit tests and overall the changes look good. However, I still have a few comments, please check. Most notably, we should avoid running anything with trust_remote_code=True by default.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, not much is missing at this point.

Besides the comments I made, could you please also add an entry to the docs?

https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/quantization.md

It doesn't have to be long, but having it helps users discover the feature.

@balvisio balvisio force-pushed the dev/ba/support-te-lora branch from 3b57840 to 22b4527 Compare February 27, 2026 20:07
@balvisio
Copy link
Author

@BenjaminBossan Thanks for looking at this. I have addressed your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants