cs231n · yorkofyou · Mar 15, 2022
diff --git a/neural-networks-3.md b/neural-networks-3.md
@@ -67,7 +67,7 @@ Also keep in mind that the deeper the network, the higher the relative errors wi
 
 **Stick around active range of floating point**. It's a good idea to read through ["What Every Computer Scientist Should Know About Floating-Point Arithmetic"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a "nicer" range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0.
 
-**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\\(max(0,x)\\)), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at \\(x = -1e6\\). Since \\(x < 0\\), the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because \\(f(x+h)\\) might cross over the kink (e.g. if \\(h > 1e-6\\)) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 \\(max(0,x)\\) terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.
+**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\\(max(0,x)\\)), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at \\(x = -1e-6\\). Since \\(x < 0\\), the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because \\(f(x+h)\\) might cross over the kink (e.g. if \\(h > 1e-6\\)) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 \\(max(0,x)\\) terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.
 
 Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all "winners" in a function of form \\(max(x,y)\\); That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating \\(f(x+h)\\) and then \\(f(x-h)\\), then a kink was crossed and the numerical gradient will not be exact.