[CUDNN] Support BFloat16 #2987

AntonOresten · 2025-11-28T17:30:45Z

This PR defines methods for making cuDNN work with BFloat16s.BFloat16.

In the following example, I show how the new methods fixes the BFloat16 backward pass of Flux.logitcrossentropy:

Before

Note: Core.BFloat16 === BFloat16s.BFloat16, but I didn't explicitly import in this REPL session.

julia> x, y = CUDA.randn(Core.BFloat16, 32), CUDA.randn(Core.BFloat16, 32); Flux.gradient(x) do x
           Flux.logitcrossentropy(x, y)
       end
ERROR: MethodError: no method matching cudnnDataType(::Type{Core.BFloat16})
The function `cudnnDataType` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  cudnnDataType(::Type{Float16})
   @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:7
  cudnnDataType(::Type{Float32})
   @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:8
  cudnnDataType(::Type{Float64})
   @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:9
  ...

Stacktrace:
  [1] cuDNN.cudnnTensorDescriptor(array::CuArray{Core.BFloat16, 4, CUDA.DeviceMemory}; format::cuDNN.cudnnTensorFormat_t, dims::Vector{Int32})
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/tensor.jl:9
  [2] cudnnSoftmaxForward!(y::CuArray{…}, x::CuArray{…}; o::@Kwargs{…})
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/softmax.jl:17
  [3] logsoftmax!(y::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}, x::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}; dims::Int64)
    @ NNlibCUDACUDNNExt ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:90
  [4] logsoftmax!
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:87 [inlined]
  [5] #logsoftmax#41
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:20 [inlined]
  [6] logsoftmax
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:19 [inlined]
  [7] #rrule#109
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:129 [inlined]
  [8] rrule
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:128 [inlined]
  [9] rrule
    @ ~/.julia/packages/ChainRulesCore/Vsbj9/src/rules.jl:144 [inlined]
 [10] chain_rrule_kw
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:246 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:-1 [inlined]
 [12] _pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81 [inlined]
 [13] #logitcrossentropy#20
    @ ~/.julia/packages/Flux/uRn8o/src/losses/functions.jl:272 [inlined]
 [14] _pullback(::Zygote.Context{…}, ::Flux.Losses.var"##logitcrossentropy#20", ::Int64, ::typeof(mean), ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{…}, ::CuArray{…})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [15] _pullback(::Zygote.Context{…}, ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{…}, ::CuArray{…})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81
 [16] #8
    @ ./REPL[14]:2 [inlined]
 [17] _pullback(ctx::Zygote.Context{false}, f::var"#8#9", args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [18] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:96
 [19] pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:94 [inlined]
 [20] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:153
 [21] #gradient#1
    @ ~/.julia/packages/Flux/uRn8o/src/gradient.jl:44 [inlined]
 [22] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Flux ~/.julia/packages/Flux/uRn8o/src/gradient.jl:31
 [23] top-level scope
    @ REPL[14]:1
Some type information was truncated. Use `show(err)` to see complete types.

After defining cudnnDataType(::Type{BFloat16})

julia> x, y = CUDA.randn(Core.BFloat16, 32), CUDA.randn(Core.BFloat16, 32); Flux.gradient(x) do x
           Flux.logitcrossentropy(x, y)
       end
ERROR: Unknown tensor type Core.BFloat16
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] scalingParameter(T::Type, val::Int64)
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:34
  [3] cudnnSoftmaxForwardWithDefaults(x::CuArray{…}; y::CuArray{…}, algo::cuDNN.cudnnSoftmaxAlgorithm_t, mode::cuDNN.cudnnSoftmaxMode_t, alpha::Int64, beta::Int64, format::cuDNN.cudnnTensorFormat_t, xDesc::cuDNN.cudnnTensorDescriptor, yDesc::cuDNN.cudnnTensorDescriptor)
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/softmax.jl:34
  [4] cudnnSoftmaxForward!(y::CuArray{…}, x::CuArray{…}; o::@Kwargs{…})
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/softmax.jl:17
  [5] logsoftmax!(y::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}, x::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}; dims::Int64)
    @ NNlibCUDACUDNNExt ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:90
  [6] logsoftmax!
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:87 [inlined]
  [7] #logsoftmax#41
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:20 [inlined]
  [8] logsoftmax
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:19 [inlined]
  [9] #rrule#109
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:129 [inlined]
 [10] rrule
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:128 [inlined]
 [11] rrule
    @ ~/.julia/packages/ChainRulesCore/Vsbj9/src/rules.jl:144 [inlined]
 [12] chain_rrule_kw
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:246 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:-1 [inlined]
 [14] _pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81 [inlined]
 [15] #logitcrossentropy#20
    @ ~/.julia/packages/Flux/uRn8o/src/losses/functions.jl:272 [inlined]
 [16] _pullback(::Zygote.Context{…}, ::Flux.Losses.var"##logitcrossentropy#20", ::Int64, ::typeof(mean), ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{…}, ::CuArray{…})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [17] _pullback(::Zygote.Context{false}, ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}, ::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81
 [18] #11
    @ ./REPL[19]:2 [inlined]
 [19] _pullback(ctx::Zygote.Context{false}, f::var"#11#12", args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [20] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:96
 [21] pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:94 [inlined]
 [22] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:153
 [23] #gradient#1
    @ ~/.julia/packages/Flux/uRn8o/src/gradient.jl:44 [inlined]
 [24] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Flux ~/.julia/packages/Flux/uRn8o/src/gradient.jl:31
 [25] top-level scope
    @ REPL[19]:1
Some type information was truncated. Use `show(err)` to see complete types.

After defining scalingParameter(::Type{BFloat16}, val)

julia> x, y = CUDA.randn(Core.BFloat16, 32), CUDA.randn(Core.BFloat16, 32); Flux.gradient(x) do x
           Flux.logitcrossentropy(x, y)
       end
(Core.BFloat16[0.19335938, 0.32226562, -0.23828125, -0.85546875, 0.953125, 0.12207031, 1.15625, -0.64453125, -0.103515625, 0.61328125  …  0.4453125, -1.203125, 1.0234375, -1.46875, 0.19628906, -0.87890625, -1.3203125, 1.515625, 0.6484375, 0.44921875],)

I also define a cptr method for consistency, but it appears the function isn't used anywhere.

Tests are added for softmax, activations, and pooling. I initially also tested convolutions, normalization, RNNs, and MHA but they don't appear to support BFloat16.

Adding BFloat16s.jl as a dependency does not affect compilation since it's already a dependency of CUDA.jl.

Along with my proposed fix in FluxML/Optimisers.jl#215, this has allowed me to train LLMs in BFloat16 with Flux.jl in Julia v1.12. I am still tinkering with Optimisers.jl, but these together would be a significant unlock for my lab.

github-actions · 2025-11-28T17:31:20Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/lib/cudnn/src/util.jl b/lib/cudnn/src/util.jl
index 8923ff9b5..c7ec0c2bd 100644
--- a/lib/cudnn/src/util.jl
+++ b/lib/cudnn/src/util.jl
@@ -4,13 +4,13 @@ using BFloat16s: BFloat16
 cptr(x,a::DenseCuArray{Float64})=Float64[x]
 cptr(x,a::DenseCuArray{Float32})=Float32[x]
 cptr(x,a::DenseCuArray{Float16})=Float32[x]
-cptr(x,a::DenseCuArray{BFloat16})=Float32[x]
+cptr(x, a::DenseCuArray{BFloat16}) = Float32[x]
 
 # Conversion between Julia and cuDNN datatypes
 cudnnDataType(::Type{Float16})=CUDNN_DATA_HALF
 cudnnDataType(::Type{Float32})=CUDNN_DATA_FLOAT
 cudnnDataType(::Type{Float64})=CUDNN_DATA_DOUBLE
-cudnnDataType(::Type{BFloat16})=CUDNN_DATA_BFLOAT16
+cudnnDataType(::Type{BFloat16}) = CUDNN_DATA_BFLOAT16
 cudnnDataType(::Type{Int8}) = CUDNN_DATA_INT8
 cudnnDataType(::Type{UInt8}) = CUDNN_DATA_UINT8
 cudnnDataType(::Type{Int32}) = CUDNN_DATA_INT32
@@ -21,7 +21,7 @@ cudnnDataType(::Type{Int32}) = CUDNN_DATA_INT32
 juliaDataType(a)=(a==CUDNN_DATA_HALF ? Float16 :
                   a==CUDNN_DATA_FLOAT ? Float32 :
                   a==CUDNN_DATA_DOUBLE ? Float64 :
-                  a==CUDNN_DATA_BFLOAT16 ? BFloat16 :
+        a == CUDNN_DATA_BFLOAT16 ? BFloat16 :
                   a==CUDNN_DATA_INT8 ? Int8 :
                   a==CUDNN_DATA_UINT8 ? UInt8 :
                   a==CUDNN_DATA_INT32 ? Int32 : error())
diff --git a/lib/cudnn/test/activation.jl b/lib/cudnn/test/activation.jl
index 7b7f2f01a..4164e0231 100644
--- a/lib/cudnn/test/activation.jl
+++ b/lib/cudnn/test/activation.jl
@@ -62,8 +62,8 @@ activationtest(alpha=2)
 activationtest(beta=2)
 
 # BFloat16 tests
-(ax,ay) = randn.(BFloat16, (10,10))
-(cx,cy) = CuArray.((ax,ay))
-activationtest(mode=CUDNN_ACTIVATION_SIGMOID)
-activationtest(mode=CUDNN_ACTIVATION_RELU)
-activationtest(mode=CUDNN_ACTIVATION_TANH)
+(ax, ay) = randn.(BFloat16, (10, 10))
+(cx, cy) = CuArray.((ax, ay))
+activationtest(mode = CUDNN_ACTIVATION_SIGMOID)
+activationtest(mode = CUDNN_ACTIVATION_RELU)
+activationtest(mode = CUDNN_ACTIVATION_TANH)
diff --git a/lib/cudnn/test/softmax.jl b/lib/cudnn/test/softmax.jl
index 68967bc1d..ab446813c 100644
--- a/lib/cudnn/test/softmax.jl
+++ b/lib/cudnn/test/softmax.jl
@@ -46,7 +46,7 @@ softmaxtest(algo=CUDNN_SOFTMAX_ACCURATE)
 softmaxtest(algo=CUDNN_SOFTMAX_LOG)
 
 # BFloat16 tests
-ax,ay = randn(BFloat16,10,10),randn(BFloat16,10,10)
-cx,cy = CuArray.((ax,ay))
+ax, ay = randn(BFloat16, 10, 10), randn(BFloat16, 10, 10)
+cx, cy = CuArray.((ax, ay))
 softmaxtest()
-softmaxtest(algo=CUDNN_SOFTMAX_LOG)
+softmaxtest(algo = CUDNN_SOFTMAX_LOG)

AntonOresten · 2025-11-30T14:05:14Z

Hm, duplicate of #1092? That one doesn't define the scalingParameter method though.

kshyatt · 2025-12-01T11:17:07Z

1.12 failure unrelated, retried CI

kshyatt · 2025-12-08T09:41:40Z

Second CI fail also seems unrelated, rerunning. If that succeeds you should rebase on top of master :)

codecov · 2025-12-08T10:21:22Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.22%. Comparing base (ca67075) to head (6359bf0).

Files with missing lines	Patch %	Lines
lib/cudnn/src/util.jl	66.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2987       +/-   ##
===========================================
+ Coverage   76.53%   89.22%   +12.68%     
===========================================
  Files         148      148               
  Lines       12860    12950       +90     
===========================================
+ Hits         9842    11554     +1712     
+ Misses       3018     1396     -1622

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AntonOresten · 2025-12-08T12:38:30Z

Thanks, Katharine. I’ve updated the BFloat16s compat entry to align with CUDA.jl’s 0.5, 0.6, as some packages have not yet added a compat entry for 0.6.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `6d2bea8`	Previous: `1af91be`	Ratio
`latency/precompile`	`55259754867` ns	`55207707384.5` ns	`1.00`
`latency/ttfp`	`7903571593` ns	`7803466984` ns	`1.01`
`latency/import`	`4159302659.5` ns	`4119333235.5` ns	`1.01`
`integration/volumerhs`	`9624485` ns	`9616867` ns	`1.00`
`integration/byval/slices=1`	`147379.5` ns	`147131` ns	`1.00`
`integration/byval/slices=3`	`425954` ns	`426158` ns	`1.00`
`integration/byval/reference`	`145246` ns	`145358` ns	`1.00`
`integration/byval/slices=2`	`286767` ns	`286555` ns	`1.00`
`integration/cudadevrt`	`103742` ns	`103753` ns	`1.00`
`kernel/indexing`	`14466` ns	`14494` ns	`1.00`
`kernel/indexing_checked`	`14918` ns	`15153` ns	`0.98`
`kernel/occupancy`	`673.8471337579617` ns	`670.8662420382166` ns	`1.00`
`kernel/launch`	`2243.6666666666665` ns	`2220.4444444444443` ns	`1.01`
`kernel/rand`	`15206` ns	`15661` ns	`0.97`
`array/reverse/1d`	`20126` ns	`20102.5` ns	`1.00`
`array/reverse/2dL_inplace`	`67043` ns	`67011` ns	`1.00`
`array/reverse/1dL`	`70416` ns	`70372` ns	`1.00`
`array/reverse/2d`	`22004` ns	`22007` ns	`1.00`
`array/reverse/1d_inplace`	`9714` ns	`9891` ns	`0.98`
`array/reverse/2d_inplace`	`13588` ns	`13546` ns	`1.00`
`array/reverse/2dL`	`74045` ns	`73927.5` ns	`1.00`
`array/reverse/1dL_inplace`	`66992` ns	`67056` ns	`1.00`
`array/copy`	`21226` ns	`20954` ns	`1.01`
`array/iteration/findall/int`	`157722.5` ns	`158738.5` ns	`0.99`
`array/iteration/findall/bool`	`139755` ns	`140481.5` ns	`0.99`
`array/iteration/findfirst/int`	`160903` ns	`161535` ns	`1.00`
`array/iteration/findfirst/bool`	`161653.5` ns	`162298` ns	`1.00`
`array/iteration/scalar`	`72877` ns	`73003.5` ns	`1.00`
`array/iteration/logical`	`215656` ns	`218149.5` ns	`0.99`
`array/iteration/findmin/1d`	`52799` ns	`53245` ns	`0.99`
`array/iteration/findmin/2d`	`96706` ns	`96825.5` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`43181` ns	`43671` ns	`0.99`
`array/reductions/reduce/Int64/dims=1`	`44959.5` ns	`44843.5` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`61617` ns	`61899` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`89151` ns	`89258` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`88131` ns	`88446` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`37040` ns	`37742` ns	`0.98`
`array/reductions/reduce/Float32/dims=1`	`52063.5` ns	`42436` ns	`1.23`
`array/reductions/reduce/Float32/dims=2`	`60261` ns	`60098` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`52559` ns	`52602` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`72421` ns	`72179` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`43557` ns	`43634` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`54292` ns	`46988` ns	`1.16`
`array/reductions/mapreduce/Int64/dims=2`	`61725` ns	`61675` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`89262` ns	`89086` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`88213` ns	`88150` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`36825` ns	`36978` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1`	`42127` ns	`48419.5` ns	`0.87`
`array/reductions/mapreduce/Float32/dims=2`	`60403` ns	`60111` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`52850` ns	`52878` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`72360` ns	`72374.5` ns	`1.00`
`array/broadcast`	`20240` ns	`20123` ns	`1.01`
`array/copyto!/gpu_to_gpu`	`13371` ns	`13003` ns	`1.03`
`array/copyto!/cpu_to_gpu`	`215905` ns	`217546` ns	`0.99`
`array/copyto!/gpu_to_cpu`	`282914` ns	`285690` ns	`0.99`
`array/accumulate/Int64/1d`	`124585.5` ns	`124863` ns	`1.00`
`array/accumulate/Int64/dims=1`	`83607` ns	`83917` ns	`1.00`
`array/accumulate/Int64/dims=2`	`157963.5` ns	`158224` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1710806` ns	`1710808` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`966785` ns	`966620` ns	`1.00`
`array/accumulate/Float32/1d`	`109529` ns	`109551.5` ns	`1.00`
`array/accumulate/Float32/dims=1`	`80750` ns	`80701.5` ns	`1.00`
`array/accumulate/Float32/dims=2`	`147664` ns	`148055.5` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1619983` ns	`1619581.5` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`698418` ns	`698770` ns	`1.00`
`array/construct`	`1271.15` ns	`1306.1` ns	`0.97`
`array/random/randn/Float32`	`44845.5` ns	`45766` ns	`0.98`
`array/random/randn!/Float32`	`25197` ns	`25261` ns	`1.00`
`array/random/rand!/Int64`	`27627` ns	`27478` ns	`1.01`
`array/random/rand!/Float32`	`8893.666666666666` ns	`8968` ns	`0.99`
`array/random/rand/Int64`	`30486` ns	`30173.5` ns	`1.01`
`array/random/rand/Float32`	`13553` ns	`13273` ns	`1.02`
`array/permutedims/4d`	`55253` ns	`56320.5` ns	`0.98`
`array/permutedims/2d`	`54398` ns	`54500` ns	`1.00`
`array/permutedims/3d`	`55036` ns	`55121.5` ns	`1.00`
`array/sorting/1d`	`2758745` ns	`2758806` ns	`1.00`
`array/sorting/by`	`3345779` ns	`3345943` ns	`1.00`
`array/sorting/2d`	`1082150.5` ns	`1082452` ns	`1.00`
`cuda/synchronization/stream/auto`	`1048` ns	`1022.3076923076923` ns	`1.03`
`cuda/synchronization/stream/nonblocking`	`7331.6` ns	`7461.6` ns	`0.98`
`cuda/synchronization/stream/blocking`	`813.8105263157895` ns	`805.6082474226804` ns	`1.01`
`cuda/synchronization/context/auto`	`1219` ns	`1182.4` ns	`1.03`
`cuda/synchronization/context/nonblocking`	`8014.6` ns	`7352` ns	`1.09`
`cuda/synchronization/context/blocking`	`903.0208333333334` ns	`901.8522727272727` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

AntonOresten · 2025-12-08T16:05:22Z

The cuDNN run on CUDA 13 appears to fail due to running on SM75.

CarloLucibello · 2025-12-13T11:17:14Z

rebase on master?

AntonOresten · 2025-12-13T13:33:48Z

Done!

AntonOresten mentioned this pull request Dec 4, 2025

Flash Attention Shared Memory constraint FluxML/NNop.jl#10

Open

AntonOresten force-pushed the bfloat16 branch 2 times, most recently from 7f8d47b to 6d2bea8 Compare December 8, 2025 12:35

github-actions bot reviewed Dec 8, 2025

View reviewed changes

AntonOresten force-pushed the bfloat16 branch 2 times, most recently from e7e97ca to 6359bf0 Compare December 13, 2025 13:31

AntonOresten and others added 2 commits December 13, 2025 15:28

add methods for BFloat16 type with tests in cuDNN

880403a

Update Project.toml

6359bf0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDNN] Support BFloat16 #2987

[CUDNN] Support BFloat16 #2987

AntonOresten commented Nov 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

AntonOresten commented Nov 30, 2025

Uh oh!

kshyatt commented Dec 1, 2025

Uh oh!

kshyatt commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

AntonOresten commented Dec 8, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

AntonOresten commented Dec 8, 2025

Uh oh!

CarloLucibello commented Dec 13, 2025

Uh oh!

AntonOresten commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CUDNN] Support BFloat16 #2987

Are you sure you want to change the base?

[CUDNN] Support BFloat16 #2987

Conversation

AntonOresten commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented Nov 30, 2025

Uh oh!

kshyatt commented Dec 1, 2025

Uh oh!

kshyatt commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AntonOresten commented Dec 8, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

AntonOresten commented Dec 8, 2025

Uh oh!

CarloLucibello commented Dec 13, 2025

Uh oh!

AntonOresten commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntonOresten commented Nov 28, 2025 •

edited

Loading

github-actions bot commented Nov 28, 2025 •

edited

Loading

codecov bot commented Dec 8, 2025 •

edited

Loading