Ref. FluxML/Flux.jl#2368. I see a couple of possibly complementary ways to go about this. Easiest would be to define an Adapt rule for OneElement so it's materialized or substituted with some GPU-friendly equivalent when run through CUDA.cu. The other would be defining overloads for certain functions such as mul! which can take advantage of the sparsity.