Skip to content

evals: avoid calling torch.compile when not needed#618

Draft
dsocolobsky wants to merge 4 commits intomainfrom
dy/evals-torchtitan-no-compile
Draft

evals: avoid calling torch.compile when not needed#618
dsocolobsky wants to merge 4 commits intomainfrom
dy/evals-torchtitan-no-compile

Conversation

@dsocolobsky
Copy link
Contributor

@dsocolobsky dsocolobsky commented Mar 4, 2026

This improves performance, we can avoid calling torch.compile in evals since that precompiles length-specific kernels which we'll not reuse since every input has different input length.

WIP performance improvements might not be real, running tests.

Closes #588

dsocolobsky and others added 3 commits March 4, 2026 17:06
This improves performance, we can avoid calling torch.compile in evals
since that precompiles length-specific kernels which we'll not reuse
since every input has different input length.
@pefontana
Copy link
Contributor

@dsocolobsky
Okay, so I think that the benchmarks you shared are not really accurate.
Yes, the average time is slightly better in dy/evals-torchtitan-no-compile vs main, but I think that's due to not having the cache loaded (model or dataset tasks). Take a look at the range:

main: Range (min … max): 106.472 s … 159.700
dy/evals-torchtitan-no-compile: Range (min … max): 107.198 s … 111.477 s

So, best case (when the model and tasks are cached), the evaluations seem to take the same time.
I ran benchmarks and the eval times don't seem to improve.

### main no python full eval
- time ./target/release/examples/evaluate     --model NousResearch/Meta-Llama-3.1-8B     --data-parallelism 8     --tasks arc_easy

ARC-Easy: {"acc_norm": 0.8055555555555556, "acc_uncond": 0.7394781144781145, "acc": 0.8127104377104377}

real	2m12.354s
user	14m4.919s
sys	0m52.347s


### main python full eval
- time ./target/release/examples/evaluate     --model NousResearch/Meta-Llama-3.1-8B     --python --python-arch Torchtitan     --data-parallelism 8     --tasks arc_easy
I didnt finished because of long times, but estimate 30 min

### main python limit 100
- time ./target/release/examples/evaluate     --model NousResearch/Meta-Llama-3.1-8B     --python --python-arch Torchtitan     --data-parallelism 8     --tasks arc_easy --limit 100
real	2m10.221s
user	4m52.856s
sys	0m13.055s

# dy/evals-torchtitan-no-compile python limit 100
 time ./target/release/examples/evaluate     --model NousResearch/Meta-Llama-3.1-8B     --python --python-arch Torchtitan     --data-parallelism 8     --tasks arc_easy --limit 100
real	2m13.175s

I would expect results similar to the no-python run.
Also, this is not the better case to use hyperfine, with just a time measure is OK

Copy link
Contributor

@pefontana pefontana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check comment

@dsocolobsky
Copy link
Contributor Author

Okay yeah, I've been trying with larger models and more documents processed and sometimes it seems like I get a speedup but sometimes I do not, so I don't think it's enough to claim we're seeing a performance improvement, moreover I added some logs and apparently I do hit compilation code sometimes.

Will leave the PR as draft for a while to continue testing a few things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Torchtitan evals are slow because torch.compilerecompiles for every unique input sequence length.

2 participants