Skip to content

Commit 34deb3b

Browse files
committed
A better example of a profiler
1 parent ff66ab9 commit 34deb3b

File tree

2 files changed

+41
-1
lines changed

2 files changed

+41
-1
lines changed

docs/src/lecture_11/lecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ is about `315` μs, which still 160x faster.
184184
NVTX.@range "julia set" juliaset_pixel.(cis, cjs, n);
185185
end
186186
```
187-
for better orientation in the code. Note that if nvtx information does not show up in the trace we have to add it to the tracing running the profiler with `--trace=cuda,nvtx`.
187+
for better orientation in the code. Note that if nvtx information does not show up in the trace we have to add it to the tracing running the profiler with `--trace=cuda,nvtx`. [for more sophisticated example click here](profile_nn.jl)
188188
Lastly it is recommended to run a kernel twice in a profile trace as the first execution of the kernel in a profiler incurs some overhead, even though the code has been already compiled.
189189

190190
In the output of the profiler we see that there is a lot of overhead caused by launching the kernel itself and then, the execution is relatively fast.

docs/src/lecture_11/profile_nn.jl

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
using CUDA
2+
3+
# define a dense layer
4+
struct Dense{W<:AbstractArray,B<:AbstractArray,F}
5+
w::W
6+
b::B
7+
f::F
8+
end
9+
10+
function Dense(idim, odim, f = identity)
11+
Dense(randn(Float32, odim, idim), randn(Float32, odim), f)
12+
end
13+
14+
function (l::Dense)(x)
15+
l.f.(l.w * x .+ l.b)
16+
end
17+
18+
#define moving of data to CPU
19+
gpu(x::AbstractArray) = CuArray(x)
20+
cpu(x::CuArray) = Array(x)
21+
gpu(l::Dense) = Dense(gpu(l.w), gpu(l.b), l.f)
22+
gpu(l::ComposedFunction) = gpu(l.outer) gpu(l.inner)
23+
24+
# a simple but powerful non-linearity
25+
relu(x::T) where {T<:Number} = max(x, zero(T))
26+
27+
28+
# Let's now define a small one hidden layer neural network
29+
x = randn(Float32, 16, 100)
30+
l₁ = Dense(16,32, relu)
31+
l₂ = Dense(32,8)
32+
nn = l₂ l₁
33+
34+
# and try to profile a computation
35+
CUDA.@profile CUDA.@sync begin
36+
NVTX.@range "moving nn to gpu" gpu_nn = gpu(nn)
37+
NVTX.@range "moving x to gpu" gpu_x = gpu(x)
38+
NVTX.@range "nn(x)" o = gpu_nn(gpu_x)
39+
NVTX.@range "moving results to cpu" cpu(o)
40+
end

0 commit comments

Comments
 (0)