slang2drjit compiles trusted local Slang CUDA kernels into Dr.Jit custom
operations and exposes them through a Python loadModule() API.
The project is an experimental v2 foundation, not a general-purpose Slang
binding generator. The public argument model is intentionally focused on
DiffTensorView and TensorView kernels, with native Dr.Jit integration for
the dtype and AD paths covered by the test suite.
- Compiles Slang kernels to CUDA source and torch-binding metadata source.
- Parses generated
__funcinfo__*metadata into a structured Python model. - Generates a native nanobind host wrapper and CUDA launch shim.
- Exposes Slang kernels as keyword-only Python methods.
- Allocates output arguments automatically.
- Supports explicit multi-output functions with
outputArgs. - Supports explicit and source-declared output allocation with
outputShapes,hostAlloc, and[DrJitEntryPoint]. - Dispatches overloads by keyword set, dtype, and parsed rank metadata.
- Caches builds by source, options, package versions, and hashed include-path contents.
- Supports Dr.Jit forward and backward AD paths for the currently verified dtype and view combinations.
- Python 3.10+
- Dr.Jit with CUDA support
slangcwith CUDA and torch-binding target support. The package includes a bundled Windowsslangc.exe; setSLANGC_PATHto override it, or putslangconPATHon other platforms.- CUDA toolkit
- CMake 3.26+
- Ninja
- A working C++/CUDA compiler toolchain:
- Windows: MSVC with CUDA integration
- Linux: GCC or Clang with CUDA integration
Native build dependencies are environment requirements today; they are not yet fully declared as optional package extras.
Install the Python package from a local checkout:
python -m pip install -e .Install Dr.Jit, CUDA, CMake, Ninja, and the platform compiler toolchain separately according to their upstream instructions.
Create a Slang kernel:
[AutoPyBindCUDA]
[CUDAKernel]
[Differentiable]
void square(DiffTensorView input, DiffTensorView output)
{
uint3 i = cudaThreadIdx() + cudaBlockIdx() * cudaBlockDim();
if (i.x >= input.size(0)) return;
output[i.x] = input[i.x] * input[i.x];
}Load and call it from Python:
import drjit as dr
import drjit.cuda.ad as ad
from slang2drjit import loadModule
module = loadModule("square.slang")
x = ad.Float([1.0, 2.0, 3.0])
dr.enable_grad(x)
y = module.square(input=x).launchRaw(
blockSize=(128, 1, 1),
gridSize=(1, 1, 1),
)
dr.backward(dr.sum(y))
print(y)
print(dr.grad(x))Python calls are keyword-only and follow slangtorch launch semantics:
module.kernel(**kwargs) returns a launchable object, and the kernel runs when
you call .launchRaw(blockSize, gridSize). Output arguments are allocated by
slang2drjit; pass only input arguments from Python.
from slang2drjit import loadModule
module = loadModule(
"kernel.slang",
defines={"USE_FAST_PATH": 1},
includePaths=["include"],
extraSlangFlags=[],
extraCudaFlags=[],
)loadModule() compiles the Slang source, builds a native extension in the
cache, imports it, and returns a proxy object whose methods correspond to
public Slang kernels.
slang2drjit mirrors slangtorch's explicit launch API:
launchable = module.square(input=x)
y = launchable.launchRaw(
blockSize=(128, 1, 1),
gridSize=((dr.width(x) + 127) // 128, 1, 1),
)blockSize and gridSize must both be tuples of three integers. As in
slangtorch, launchTotal() and autoLaunch() are present but not implemented;
use launchRaw() for now.
By default, the last public Slang argument is treated as the single output:
module = loadModule("square.slang")
y = module.square(input=x).launchRaw(blockSize=(128, 1, 1), gridSize=(1, 1, 1))For multiple outputs, mark them explicitly:
module = loadModule(
"split.slang",
outputArgs={"split": ("left", "right")},
)
left, right = module.split(input=x).launchRaw(blockSize=(128, 1, 1), gridSize=(1, 1, 1))Multiple outputs return a plain Python tuple.
When an output shape differs from the first input shape, provide outputShapes:
module = loadModule(
"reshape_output.slang",
outputShapes={"reshapeOut": {"output": (2, 3)}},
)
y = module.reshapeOut(input=x).launchRaw(blockSize=(128, 1, 1), gridSize=(1, 1, 1))
assert tuple(y.shape) == (2, 3)For allocation logic that belongs with the Slang source, use
[DrJitEntryPoint]:
[DrJitEntryPoint]
DiffTensorView reshapeOut(DiffTensorView input)
{
var output = DrJitTensor<float>.empty(input.size(0) / 3, 3);
__dispatch_kernel(reshapeOut_kernel)(input, output);
return output;
}
[AutoPyBindCUDA]
[CUDAKernel]
[Differentiable]
void reshapeOut_kernel(DiffTensorView input, DiffTensorView output)
{
/* write output */
}Then Python can load the file directly:
module = loadModule("reshape_output.slang")
y = module.reshapeOut(input=x).launchRaw(blockSize=(128, 1, 1), gridSize=(1, 1, 1))hostAlloc remains available as a lower-level Python-side allocation rule:
module = loadModule(
"reshape_output.slang",
hostAlloc={"reshapeOut": {"output": lambda input: (input.shape[0] // 3, 3)}},
)For data-dependent output lengths, allocate a capacity-shaped output and return a separate count output:
[DrJitEntryPoint]
(DiffTensorView, TensorView<uint>) compact(DiffTensorView input)
{
var values = DrJitTensor<float>.emptyLike(input);
var count = DrJitTensor<uint>.empty(1);
__dispatch_kernel(compact_kernel)(input, values, count);
return values, count;
}Overloads are exposed through one Python method and selected at call time:
[AutoPyBindCUDA]
[CUDAKernel]
[Differentiable]
void tag(DiffTensorView x, DiffTensorView out) { /* float path */ }
[AutoPyBindCUDA]
[CUDAKernel]
[Differentiable]
void tag(TensorView<uint> x, TensorView<uint> out) { /* uint path */ }module = loadModule("dtype_overloads.slang")
float_result = module.tag(x=ad.Float([1.0, 2.0, 3.0])).launchRaw(
blockSize=(128, 1, 1),
gridSize=(1, 1, 1),
)
uint_result = module.tag(x=ad.UInt([1, 2, 3])).launchRaw(
blockSize=(128, 1, 1),
gridSize=(1, 1, 1),
)Native extensions are cached by a build key derived from:
- Slang source contents
- selected compile options and flags
- relevant package versions
- output metadata options
- hashed
includePathscontents
This keeps repeated loads fast while invalidating builds when source, configuration, or included files change.
Public Slang arguments are limited to the DiffTensorView and TensorView
families.
Currently verified native paths include:
- Scalar array dtypes:
float32,float16,float64,int32,uint32,int8,uint8,int64,uint64 - Tensor dtypes:
TensorXf,TensorXf64,TensorXi,TensorXu - AD paths:
float32,float16,float64,TensorXf,TensorXf64 - Multi-output AD for float arrays and
TensorXf - Mixed float/int and mixed float/uint backward paths
- The host wrapper uses a local
S2DCustomOpcompatibility path. It avoids#define private publicand direct writes to Dr.Jit privateCustomOp::m_output, but it is not a direct call to the upstreamdrjit::custom()helper. - Metadata parsing is mostly regex/string based over generated Slang/C++ output, not a full AST parser.
- Slang scalar parameters, structs, buffers, textures, and samplers are out of scope for now.
- Packaging metadata is not yet ready for a stable PyPI release.
Run the standard checks with a repository-local pytest temp directory:
python -m pytest -q --basetemp=.tmp/pytest-basetemp
python -m compileall -q src testsRun CUDA integration tests in an environment that has Dr.Jit CUDA, slangc,
CUDA, CMake, Ninja, and a working C++/CUDA compiler toolchain:
python -m pytest -q -m cuda_integration --basetemp=.tmp/pytest-basetempIf CUDA integration tests are skipped, inspect the skip reason before treating native build support as verified.
slang2drjit is distributed under the MIT License. See LICENSE.