guest lecturer: David P. Sanders, Universidad Nacional Autónoma de México (UNAM) & MIT
goal: calculate derivatives of functions:
- mathematical functions
- programming language functions
so, derivatives of functions… or algorithms
reminder: mathematical basics of derivatives
calculate derivatives w/ computer automatically
have a catalog of differentiate, want to have computer apply them instead of us
why do we want derivatives?
- solve ODEs
- gradient descent
- solve nonlinear equations using Newton’s method
- calculate sensitivities
have nonlinear equation
want to solve f, i.e. find root
using an iterative method (‘cause there’s no symbolic method)
$xn+1 = g(x_n)$ with initial condition
s.t. $limn → ∞ x_n = x^*$
newton’s method: pick point, draw tangent line to curve, walk along it to
algebraically:
solve
if
quadratic is nonlinear… which is hard
approximate by truncating to linear part:
…this isn’t very well-behaved unless you use some tricks
derivative definition:
moving two points closer together along a line, looking at slope between them
(jhg: what are the conditions that have to be in place for this to work?)
rewrite to remove limit;
use little o notation:
so:
so: if
$A = f(a)$ $B = f’(a)$
(jhg: note: B can’t be in
so, derivative of
this is a useful trick; we’re using a taylor expansion of
some people say “derivative is best linear approximation of a function” … actually:
-
tangent line, written as a function, is best affine approximation of
$f$ at a point
note: we can compute higher-order derivatives in a similar way, not going to show this
(jhg: this is just a formal justification to derive the rules we learned in high school
taylor expansion is also called “jet”;
exercise:
given the above, what information do we need to calculate derivatives of combined functions on a computer?
we only need:
(sanders: why is this enough? well, it’s a first-order taylor expansion, so we only have access to first-order information)
in the computer; we’ll represent this data with a pair of numbers
what data structures do we use here?
- tuple
- vector / list
- (jhg: struct of arrays? sanders: nothing so complicated…)
- …introduce a new type! because new behaviour. (sanders: behaviour has a “u” in it.)
in julia:
using Base
struct SimpleDual # "dual number"
val :: Float64
der :: Float64
end
f = SimpleDual(3, 4)
g = SimpleDual(5, 6)
note: each dual can represent a huge (sanders: uncountable??) number of possible functions!
now, what happens if we add these things?
f + g
julia hasn’t been taught to add these things…
+(f :: SimpleDual, g :: SimpleDual) = SimpleDual(f.val + g.val, f.der + g.der)
f + g
so now we’ve encoded a rule from calculus in julia!
let’s make it generic:
struct Dual{T <: Number}
val :: T
der :: T
end
Base.:+(f :: Dual, g :: Dual) = Dual(f.val + g.val, f.der + g.der)
Base.:*(f :: Dual, alpha :: Number) = Dual(f.val * alpha, f.der * alpha)
Base.:*(alpha::Number, f :: Dual) = Dual(f.val * alpha, f.der * alpha)
f = Dual(3, 4)
g = Dual(5, 6)
f + g
now, derivatives:
h(x) = x * x + 2.0x
xx = Dual(3.0, 1.0)
h(xx)
we’ve now encoded standard differentiation in a computer!
(this is forward-mode differentiation.)
ml is in a million dimensions, so let’s scale this.
have
want to calculate gradient of
now want:
note:
for
forward mode is inefficient because
…
lookup in notes …
forward-mode: with 3 directions, you’re effectively computing perturbations in 3 different directions
instead, use reverse mode: more efficient
instead of propagating each perturbation forward; record partial derivatives for each elementary operation; propagate backwards using chain rule
can also accumulate into gradients, when input variables occur in multiple places (jhg: i still don’t understand this.)
(jhg: visualization: plot several summed lines, show sums of derivatives. 2d, like that old fourier transform visualization…)
can be implemented source-to-source or with a tape