Skip to content

[SYCL] Trim builtins header dependencies for compile time#21631

Draft
koparasy wants to merge 3 commits intointel:syclfrom
koparasy:sycl-builtins-frontend-compile-time-reduction
Draft

[SYCL] Trim builtins header dependencies for compile time#21631
koparasy wants to merge 3 commits intointel:syclfrom
koparasy:sycl-builtins-frontend-compile-time-reduction

Conversation

@koparasy
Copy link
Contributor

@koparasy koparasy commented Mar 25, 2026

Summary

Reduce frontend compile-time overhead in sycl/include/sycl/detail/builtins/builtins.hpp by narrowing its dependencies and excluding host-only helpers from device compilation.

This change:

Replaces the sycl/detail/vector_convert.hpp inclusion with the narrower headers actually needed here: sycl/detail/generic_type_traits.hpp and sycl/half_type.hpp.

Wraps:

  1. builtin_default_host_impl and
  2. builtin_delegate_to_scalar

in #ifndef __SYCL_DEVICE_ONLY__ so they are not parsed during device-only compilation.
The intent is to reduce parsing and template-instantiation cost for translation units that pull in sycl/builtins.hpp, while preserving existing behavior.

Motivation

Compilation-time tracing for a device-side SYCL workload that includes sycl/builtins.hpp showed that the cost is frontend-dominated. This header is on a hot include path, so avoiding an unnecessary transitive dependency and skipping host-only helper code in device mode reduces frontend work in the common device compilation path.

Rough numbers:

HEAD compiler: 
  iter 1: frontend=1183.220 ms  backend=44.570 ms    execute=1234.452 ms  trace=/tmp/sycl-trace-rins/orig/orig.iter1.json
  iter 2: frontend=1178.349 ms  backend=44.313 ms    execute=1229.331 ms  trace=/tmp/sycl-trace-rins/orig/orig.iter2.json
  iter 3: frontend=1186.436 ms  backend=45.257 ms    execute=1238.330 ms  trace=/tmp/sycl-trace-rins/orig/orig.iter3.json
  iter 4: frontend=1180.083 ms  backend=44.657 ms    execute=1231.393 ms  trace=/tmp/sycl-trace-rins/orig/orig.iter4.json
  iter 5: frontend=1176.515 ms  backend=44.306 ms    execute=1227.457 ms  trace=/tmp/sycl-trace-rins/orig/orig.iter5.json
  avg:     frontend=1180.921 ms  backend=44.621 ms    execute=1232.193 ms

This Branch:
  iter 1: frontend=1087.281 ms  backend=45.098 ms    execute=1138.897 ms  trace=/tmp/sycl-trace-rins/new/new.iter1.json
  iter 2: frontend=1088.384 ms  backend=44.814 ms    execute=1139.661 ms  trace=/tmp/sycl-trace-rins/new/new.iter2.json
  iter 3: frontend=1087.635 ms  backend=44.241 ms    execute=1138.346 ms  trace=/tmp/sycl-trace-rins/new/new.iter3.json
  iter 4: frontend=1086.537 ms  backend=44.334 ms    execute=1137.336 ms  trace=/tmp/sycl-trace-rins/new/new.iter4.json
  iter 5: frontend=1091.378 ms  backend=44.172 ms    execute=1142.013 ms  trace=/tmp/sycl-trace-rins/new/new.iter5.json
  avg:     frontend=1088.243 ms  backend=44.532 ms    execute=1139.251 ms

when compiling the benchmark (gemm_sycl.cpp) provided in JIRA ticket (CMPLRLLVM-73941). This is roughly a 7% compile time reduction.

Replace the  vector_convert.hpp dependency in sycl/detail/builtins/builtins.hpp with narrower includes and exclude host-only builtin helper templates from device compilation.

This reduces frontend parsing/instantiation work for users of sycl/builtins.hpp while preserving the public header surface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant