Skip to content

Commit d318361

Browse files
committed
2 parents aca4795 + 06d4403 commit d318361

File tree

2 files changed

+23
-25
lines changed

2 files changed

+23
-25
lines changed

docs/src/lecture_08/lab.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -72,9 +72,7 @@ a = x*y # da/dx = y; da/dy = x
7272
b = sin(x) # db/dx = cos(x)
7373
z = a + b # dz/da = 1; dz/db = 1
7474
```
75-
```@raw html
76-
<p><center><img src="graph.png" alt="graph"></center></p>
77-
```
75+
![graph](graph.png)
7876

7977
In the graph you can see that the variable `x` can directly affect `b` and `a`.
8078
Hence, `x` has two children `a` and `b`. During the forward pass we build the

docs/src/lecture_08/lecture.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ The computation of gradient ``\frac{\partial f}{\partial x}`` *theoretically* bo
4343

4444
The complexity of the computation (at least one part of it) is therefore therefore determined by the Matrix multiplication, which is generally expensive, as theoretically it has complexity at least ``O(n^{2.3728596}),`` but in practice a little bit more as the lower bound hides the devil in the ``O`` notation. The order in which the Jacobians are multiplied has therefore a profound effect on the complexity of the AD engine. While determining the optimal order of multiplication of sequence of matrices is costly, in practice, we recognize two important cases.
4545

46-
1. Jacobians are multiplied from right to left as ``J_1 \times (J_2 \times ( \ldots \times (J_{n-1}) \times J_n))))`` which has the advantage when the input dimension of ``f: \mathbb{R}^n \rightarrow \mathbb{R}^m`` is smaller than the output dimension, ``n < m``. - referred to as the **FORWARD MODE**
47-
2. Jacobians are multiplied from left to right as ``((((J_1 \times J_2) \times J_3) \times \ldots ) \times J_n`` which has the advantage when the input dimension of ``f: \mathbb{R}^n \rightarrow \mathbb{R}^m`` is larger than the output dimension, ``n > m``. - referred to as the **BACKWARD MODE**
46+
1. Jacobians are multiplied from right to left as ``J_1 \times (J_2 \times ( \ldots \times (J_{n-1} \times J_n) \ldots))`` which has the advantage when the input dimension of ``f: \mathbb{R}^n \rightarrow \mathbb{R}^m`` is smaller than the output dimension, ``n < m``. - referred to as the **FORWARD MODE**
47+
2. Jacobians are multiplied from left to right as ``( \ldots ((J_1 \times J_2) \times J_3) \times \ldots ) \times J_n`` which has the advantage when the input dimension of ``f: \mathbb{R}^n \rightarrow \mathbb{R}^m`` is larger than the output dimension, ``n > m``. - referred to as the **BACKWARD MODE**
4848

4949
The ubiquitous in machine learning to minimization of a scalar (loss) function of a large number of parameters. Also notice that for `f` of certain structures, it pays-off to do a mixed-mode AD, where some parts are done using forward diff and some parts using reverse diff.
5050

@@ -141,7 +141,7 @@ and compute its value at ``v + \dot v \epsilon`` (note that we know how to do ad
141141
```math
142142
\begin{split}
143143
p(v) &=
144-
\sum_{i=0}^n p_i(v + v \epsilon )^i =
144+
\sum_{i=0}^n p_i(v + \dot{v} \epsilon )^i =
145145
\sum_{i=0}^n \left[p_i \sum_{j=0}^{n}\binom{i}{j}v^{i-j}(\dot v \epsilon)^{i}\right] =
146146
p_0 + \sum_{i=1}^n \left[p_i \sum_{j=0}^{1}\binom{i}{j}v^{i-j}(\dot v \epsilon)^{j}\right] = \\
147147
&= p_0 + \sum_{i=1}^n p_i(v^i + i v^{i-1} \dot v \epsilon )
@@ -155,7 +155,7 @@ Let's now consider a general function ``f:\mathbb{R} \rightarrow \mathbb{R}``. I
155155
f(v+\dot v \epsilon) = \sum_{i=0}^\infty \frac{f^i(v)\dot v^i\epsilon^n}{i!}
156156
= f(v) + f'(v)\dot v\epsilon,
157157
```
158-
where all higher order terms can be dropped because ``\epsilon^i=0`` for ``i>1``. This shows that we can calculate the gradient of ``f`` at point `v` by calculating its value at `f(v + \epsilon)` and taking the multiplier of `\epsilon`.
158+
where all higher order terms can be dropped because ``\epsilon^i=0`` for ``i>1``. This shows that we can calculate the gradient of ``f`` at point ``v`` by calculating its value at ``f(v + \epsilon)`` and taking the multiplier of ``\epsilon``.
159159

160160
#### Implementing Dual number with Julia
161161
To demonstrate the simplicity of Dual numbers, consider following definition of Dual numbers, where we define a new number type and overload functions `+`, `-`, `*`, and `/`. In Julia, this reads:
@@ -178,7 +178,7 @@ Base.promote_rule(::Type{Dual{T}}, ::Type{S}) where {T<:Number,S<:Number} = Dual
178178
Base.promote_rule(::Type{Dual{T}}, ::Type{Dual{S}}) where {T<:Number,S<:Number} = Dual{promote_type(T,S)}
179179
180180
# and define api for forward differentionation
181-
forward_diff(f::Function, x::Number) = _dual(f(Dual(x,1.0)))
181+
forward_diff(f::Function, x::Real) = _dual(f(Dual(x,1.0)))
182182
_dual(x::Dual) = x.d
183183
_dual(x::Vector) = _dual.(x)
184184
```
@@ -216,9 +216,9 @@ plot!(0.1:0.01:2, forward_dsqrt, label="Dual Forward Mode f'", lw=3, ls=:dash)
216216
2. To make the forward diff work in Julia, we only need to **_overload_** a few **_operators_** for forward mode AD to
217217
work on **_any function_**
218218
3. For vector valued function we can use [**_Hyperduals_**](http://adl.stanford.edu/hyperdual/)
219-
5. Forward diff can differentiation through the `setindex!` (more on this later on)
220-
6. ForwardDiff is implemented in `ForwardDiff.jl`, which might appear to be neglected, but the truth is that it is very stable and general implementation.
221-
7. ForwardDiff does not have to be implemented through Dual numbers. It can be implemented similarly to ReverseDiff through multiplication of Jacobians, which is what is the community work on now (in `Diffractor`, `Zygote` with rules defined in `ChainRules`).
219+
5. Forward diff can differentiation through the `setindex!` (called each time an element is assigned to a place in array, e.g. `x = [1,2,3]; x[2] = 1`)
220+
6. ForwardDiff is implemented in [`ForwardDiff.jl`](https://github.com/JuliaDiff/ForwardDiff.jl), which might appear to be neglected, but the truth is that it is very stable and general implementation.
221+
7. ForwardDiff does not have to be implemented through Dual numbers. It can be implemented similarly to ReverseDiff through multiplication of Jacobians, which is what is the community work on now (in [`Diffractor`](https://github.com/JuliaDiff/Diffractor.jl), [`Zygote`](https://github.com/FluxML/Zygote.jl) with rules defined in [`ChainRules`](https://github.com/JuliaDiff/ChainRules.jl)).
222222
---
223223

224224
## Reverse mode
@@ -247,7 +247,7 @@ The need to store intermediate outs has a huge impact on memory requirements, wh
247247
- When differentiating **Invertible functions**, calculate intermediate outputs from the output. This can lead to huge performance gain, as all data needed for computations are in caches.
248248
- **Checkpointing** does not store intermediate ouputs after larger sequence of operations. When they are needed for forward pass, they are recalculated on demand.
249249

250-
Most reverse mode AD engines does not support mutating values of arrays (`setindex!` in julia). This is related to the memory consumption, where after every `setindex!` you need in theory save the full matrix. `Enzyme` differentiating directly LLVM code supports this, since in LLVM every variable is assigned just once. ForwardDiff methods does not suffer this problem, as the gradient is computed at the time of the values.
250+
Most reverse mode AD engines does not support mutating values of arrays (`setindex!` in julia). This is related to the memory consumption, where after every `setindex!` you need in theory save the full matrix. [`Enzyme`](https://github.com/wsmoses/Enzyme.jl) differentiating directly LLVM code supports this, since in LLVM every variable is assigned just once. ForwardDiff methods does not suffer this problem, as the gradient is computed at the time of the values.
251251

252252
!!! info
253253
Reverse mode AD was first published in 1976 by Seppo Linnainmaa[^1], a finnish computer scientist. It was popularized in the end of 80s when applied to training multi-layer perceptrons, which gave rise to the famous **backpropagation** algorithm[^2], which is a special case of reverse mode AD.
@@ -328,25 +328,25 @@ track(a::Number, string_tape) = TrackedArray(reshape([a], 1, 1), string_tape)
328328

329329
import Base: +, *
330330
function *(A::TrackedMatrix, B::TrackedMatrix)
331-
a, b = value.((A, B))
332-
C = TrackedArray(a * b, "($(A.string_tape) * $(B.string_tape))")
333-
push!(A.tape, (C, Δ -> Δ * b'))
334-
push!(B.tape, (C, Δ -> a' * Δ))
335-
C
331+
a, b = value.((A, B))
332+
C = TrackedArray(a * b, "($(A.string_tape) * $(B.string_tape))")
333+
push!(A.tape, (C, Δ -> Δ * b'))
334+
push!(B.tape, (C, Δ -> a' * Δ))
335+
C
336336
end
337337

338338
function +(A::TrackedMatrix, B::TrackedMatrix)
339-
C = TrackedArray(value(A) + value(B), "($(A.string_tape) + B)")
340-
push!(A.tape, (C, Δ -> Δ))
341-
push!(B.tape, (C, Δ -> Δ))
342-
C
339+
C = TrackedArray(value(A) + value(B), "($(A.string_tape) + $(B.string_tape))")
340+
push!(A.tape, (C, Δ -> Δ))
341+
push!(B.tape, (C, Δ -> Δ))
342+
C
343343
end
344344

345345
function msin(A::TrackedMatrix)
346-
a = value(A)
347-
C = TrackedArray(sin.(a), "sin($(A.string_tape))")
348-
push!(A.tape, (C, Δ -> cos.(a) .* Δ))
349-
C
346+
a = value(A)
347+
C = TrackedArray(sin.(a), "sin($(A.string_tape))")
348+
push!(A.tape, (C, Δ -> cos.(a) .* Δ))
349+
C
350350
end
351351
```
352352

0 commit comments

Comments
 (0)