Skip to content

Latest commit

 

History

History
250 lines (163 loc) · 6.95 KB

8-automatic-differentiation.org

File metadata and controls

250 lines (163 loc) · 6.95 KB

lecture 8: automatic differentiation

intro

guest lecturer: David P. Sanders, Universidad Nacional Autónoma de México (UNAM) & MIT

goal: calculate derivatives of functions:

  • mathematical functions
  • programming language functions

so, derivatives of functions… or algorithms

reminder: mathematical basics of derivatives

calculate derivatives w/ computer automatically

have a catalog of differentiate, want to have computer apply them instead of us

why do we want derivatives?

  • solve ODEs
  • gradient descent
  • solve nonlinear equations using Newton’s method
  • calculate sensitivities

newton’s method

have nonlinear equation $f : \mathbb{R} → \mathbb{R}$

want to solve f, i.e. find root $x^*$ s.t. $f(x^*) = 0$

using an iterative method (‘cause there’s no symbolic method) $xn+1 = g(x_n)$ with initial condition $x_0$

s.t. $limn → ∞ x_n = x^*$

newton’s method: pick point, draw tangent line to curve, walk along it to $y = 0$; repeat

algebraically: solve $f(x_1) = 0$ where $x_1 = x_0 + δ$

$f$ is nonlinear… which is hard

if $δ$ is small, then we can taylorize function: $f(x_0) + δ f’(x_0) + \frac{1}{2} δ^2 f”(x_0) + …$

quadratic is nonlinear… which is hard

approximate by truncating to linear part: $f(x_1) = 0 ~ f(x_0) + δ f’(x_0)$ therefore $δ = -f(x_0) / f’(x_1)$

…this isn’t very well-behaved unless you use some tricks

now, on the computer

$f : \mathbb{R} → \mathbb{R}$

derivative definition:

$$f’(a) := limh→ 0 [\frac{f(a+h) - f(a)}{h} ]$$

moving two points closer together along a line, looking at slope between them

(jhg: what are the conditions that have to be in place for this to work?)

rewrite to remove limit;

$$limh→ 0 [\frac{f(a+h) - f(a)}{h} - f(a)] = 0$$

use little o notation:

$$\frac{f(a+h) - f(a)}{h} is o(h)$$ if $$\frac{f(h)}{h} →h→ 0 0$$

so:

$$f(a + h) = f(a) + h f’(a) + o(h)$$

so: if $f(a+h) = A + Bh + o(h)$, then:

  • $A = f(a)$
  • $B = f’(a)$

(jhg: note: B can’t be in $o(h)$ part, because then it wouldn’t be O(h) ;))

example: derivative of $f ⋅ g$

$f,g : \mathbb{R} → \mathbb{R}$ assume $f$, $g$ “smooth enough” (sanders: in an analysis class, we could say they’re $C_1(α)$, but we’re not, so…)

$(f ⋅ g)(a + h) = f(a+h) ⋅ g(a+h)$ … multiply out approximations, combine $o(h)$

so, derivative of $f ⋅ g$ is: $f(a)g’(a) + g(a)f’(a)$

this is a useful trick; we’re using a taylor expansion of $f$ and $g$ to find taylor polynomial of $f ⋅ g$, then reduce to taylor expansion of $f ⋅ g$

some people say “derivative is best linear approximation of a function” … actually:

  • tangent line, written as a function, is best affine approximation of $f$ at a point

note: we can compute higher-order derivatives in a similar way, not going to show this

(jhg: this is just a formal justification to derive the rules we learned in high school

taylor expansion is also called “jet”; $f(a) + hf’(a) + o(h)$ is “jet of order 1”

exercise: $(f + g)’(a) = f’(a) + g’(a)$

back towards AD

given the above, what information do we need to calculate derivatives of combined functions on a computer?

we only need: $f(a)$ and $f’(a)$ for each $f$, at point $a$

(sanders: why is this enough? well, it’s a first-order taylor expansion, so we only have access to first-order information)

in the computer; we’ll represent this data with a pair of numbers $(f(a), f’(a))$ for each $f$

what data structures do we use here?

  • tuple
  • vector / list
  • (jhg: struct of arrays? sanders: nothing so complicated…)
  • …introduce a new type! because new behaviour. (sanders: behaviour has a “u” in it.)

in julia:

using Base
struct SimpleDual # "dual number"
    val :: Float64
    der :: Float64
end

f = SimpleDual(3, 4)
g = SimpleDual(5, 6)

note: each dual can represent a huge (sanders: uncountable??) number of possible functions!

now, what happens if we add these things?

f + g

julia hasn’t been taught to add these things…

+(f :: SimpleDual, g :: SimpleDual) = SimpleDual(f.val + g.val, f.der + g.der)
f + g

so now we’ve encoded a rule from calculus in julia!

let’s make it generic:

struct Dual{T <: Number}
    val :: T
    der :: T
end
Base.:+(f :: Dual, g :: Dual) = Dual(f.val + g.val, f.der + g.der)
Base.:*(f :: Dual, alpha :: Number) = Dual(f.val * alpha, f.der * alpha)
Base.:*(alpha::Number, f :: Dual) = Dual(f.val * alpha, f.der * alpha)

f = Dual(3, 4)
g = Dual(5, 6)
f + g

now, derivatives:

h(x) = x * x + 2.0x
xx = Dual(3.0, 1.0)
h(xx)

we’ve now encoded standard differentiation in a computer!

(this is forward-mode differentiation.)

higher dimensions

ml is in a million dimensions, so let’s scale this.

have $f: \mathbb{R}^n → \mathbb{R}$

want to calculate gradient of $f$ at $a$: $∇ f(a)=$ partial derivatives of each $x_i$ at $a$

$$f(a+h) = f(a) + ∑_i \frac{∂ f}{∂ x_i} |_a + o(||h||)$$

now want: $$(f × g)(a + h) = [f(a) + h ∇ f(a) ] * [g(a) h ∇ g(a)] + o(||h||)$$

$$= f(a)g(a) + [f(a) ∇ g(a) + g(a) ∇ f(a) ] ⋅ h + o(||h||)$$

note: $a,h ∈ \mathbb{R}^n$; $⋅ h$ is dot product w/ $h$

for $\mathbb{R}^n → \mathbb{R}^m$ now we have jacobian matrix; $∇$s for each component

reverse-mode differentiation

forward mode is inefficient because $h$ is n-dimensional

lookup in notes …

forward-mode: with 3 directions, you’re effectively computing perturbations in 3 different directions

instead, use reverse mode: more efficient

instead of propagating each perturbation forward; record partial derivatives for each elementary operation; propagate backwards using chain rule

can also accumulate into gradients, when input variables occur in multiple places (jhg: i still don’t understand this.)

(jhg: visualization: plot several summed lines, show sums of derivatives. 2d, like that old fourier transform visualization…)

can be implemented source-to-source or with a tape