-
Notifications
You must be signed in to change notification settings - Fork 2
/
tika.out
2000 lines (1304 loc) · 43.9 KB
/
tika.out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Adam: A Method for Stochastic Optimization
Published as a conference paper at ICLR 2015
ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION
Diederik P. Kingma*
University of Amsterdam, OpenAI
Jimmy Lei Ba∗
University of Toronto
ABSTRACT
We introduce Adam, an algorithm for first-order gradient-based optimization of
stochastic objective functions, based on adaptive estimates of lower-order mo-
ments. The method is straightforward to implement, is computationally efficient,
has little memory requirements, is invariant to diagonal rescaling of the gradients,
and is well suited for problems that are large in terms of data and/or parameters.
The method is also appropriate for non-stationary objectives and problems with
very noisy and/or sparse gradients. The hyper-parameters have intuitive interpre-
tations and typically require little tuning. Some connections to related algorithms,
on which Adam was inspired, are discussed. We also analyze the theoretical con-
vergence properties of the algorithm and provide a regret bound on the conver-
gence rate that is comparable to the best known results under the online convex
optimization framework. Empirical results demonstrate that Adam works well in
practice and compares favorably to other stochastic optimization methods. Finally,
we discuss AdaMax, a variant of Adam based on the infinity norm.
1 INTRODUCTION
Stochastic gradient-based optimization is of core practical importance in many fields of science and
engineering. Many problems in these fields can be cast as the optimization of some scalar parameter-
ized objective function requiring maximization or minimization with respect to its parameters. If the
function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization
method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same
computational complexity as just evaluating the function. Often, objective functions are stochastic.
For example, many objective functions are composed of a sum of subfunctions evaluated at different
subsamples of data; in this case optimization can be made more efficient by taking gradient steps
w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself
as an efficient and effective optimization method that was central in many machine learning success
stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton
& Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other
sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For
all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this
paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In
these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be
restricted to first-order methods.
We propose Adam, a method for efficient stochastic optimization that only requires first-order gra-
dients with little memory requirement. The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients; the name Adam
is derived from adaptive moment estimation. Our method is designed to combine the advantages
of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gra-
dients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary
settings; important connections to these and other stochastic optimization methods are clarified in
section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to
rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter,
it does not require a stationary objective, it works with sparse gradients, and it naturally performs a
form of step size annealing.
∗Equal contribution. Author ordering determined by coin flip over a Google Hangout.
1
ar
X
iv
:1
41
2.
69
80
v9
[
cs
.L
G
]
3
0
Ja
n
20
17
Published as a conference paper at ICLR 2015
Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details,
and for a slightly more efficient (but less clear) order of computation. g2t indicates the elementwise
square gt � gt. Good default settings for the tested machine learning problems are α = 0.001,
β1 = 0.9, β2 = 0.999 and � = 10−8. All operations on vectors are element-wise. With βt1 and β
t
2
we denote β1 and β2 to the power t.
Require: α: Stepsize
Require: β1, β2 ∈ [0, 1): Exponential decay rates for the moment estimates
Require: f(θ): Stochastic objective function with parameters θ
Require: θ0: Initial parameter vector
m0 ← 0 (Initialize 1st moment vector)
v0 ← 0 (Initialize 2nd moment vector)
t← 0 (Initialize timestep)
while θt not converged do
t← t+ 1
gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 ·mt−1 + (1− β1) · gt (Update biased first moment estimate)
vt ← β2 · vt−1 + (1− β2) · g2t (Update biased second raw moment estimate)
m̂t ← mt/(1− βt1) (Compute bias-corrected first moment estimate)
v̂t ← vt/(1− βt2) (Compute bias-corrected second raw moment estimate)
θt ← θt−1 − α · m̂t/(
√
v̂t + �) (Update parameters)
end while
return θt (Resulting parameters)
In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains
our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s
convergence in online convex programming. Empirically, our method consistently outperforms other
methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is
a versatile algorithm that scales to large-scale high-dimensional machine learning problems.
2 ALGORITHM
See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let f(θ) be a noisy objec-
tive function: a stochastic scalar function that is differentiable w.r.t. parameters θ. We are in-
terested in minimizing the expected value of this function, E[f(θ)] w.r.t. its parameters θ. With
f1(θ), ..., , fT (θ) we denote the realisations of the stochastic function at subsequent timesteps
1, ..., T . The stochasticity might come from the evaluation at random subsamples (minibatches)
of datapoints, or arise from inherent function noise. With gt = ∇θft(θ) we denote the gradient, i.e.
the vector of partial derivatives of ft, w.r.t θ evaluated at timestep t.
The algorithm updates exponential moving averages of the gradient (mt) and the squared gradient
(vt) where the hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these moving
averages. The moving averages themselves are estimates of the 1st moment (the mean) and the
2nd raw moment (the uncentered variance) of the gradient. However, these moving averages are
initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially
during the initial timesteps, and especially when the decay rates are small (i.e. the βs are close to 1).
The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected
estimates m̂t and v̂t. See section 3 for more details.
Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing
the order of computation, e.g. by replacing the last three lines in the loop with the following lines:
αt = α ·
√
1− βt2/(1− βt1) and θt ← θt−1 − αt ·mt/(
√
vt + �̂).
2.1 ADAM’S UPDATE RULE
An important property of Adam’s update rule is its careful choice of stepsizes. Assuming � = 0, the
effective step taken in parameter space at timestep t is ∆t = α · m̂t/
√
v̂t. The effective stepsize has
two upper bounds: |∆t| ≤ α · (1 − β1)/
√
1− β2 in the case (1 − β1) >
√
1− β2, and |∆t| ≤ α
2
Published as a conference paper at ICLR 2015
otherwise. The first case only happens in the most severe case of sparsity: when a gradient has
been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize
will be smaller. When (1 − β1) =
√
1− β2 we have that |m̂t/
√
v̂t| < 1 therefore |∆t| < α. In
more common scenarios, we will have that m̂t/
√
v̂t ≈ ±1 since |E[g]/
√
E[g2]| ≤ 1. The effective
magnitude of the steps taken in parameter space at each timestep are approximately bounded by
the stepsize setting α, i.e., |∆t| / α. This can be understood as establishing a trust region around
the current parameter value, beyond which the current gradient estimate does not provide sufficient
information. This typically makes it relatively easy to know the right scale of α in advance. For
many machine learning models, for instance, we often know in advance that good optima are with
high probability within some set region in parameter space; it is not uncommon, for example, to
have a prior distribution over the parameters. Since α sets (an upper bound of) the magnitude of
steps in parameter space, we can often deduce the right order of magnitude of α such that optima
can be reached from θ0 within some number of iterations. With a slight abuse of terminology,
we will call the ratio m̂t/
√
v̂t the signal-to-noise ratio (SNR). With a smaller SNR the effective
stepsize ∆t will be closer to zero. This is a desirable property, since a smaller SNR means that
there is greater uncertainty about whether the direction of m̂t corresponds to the direction of the true
gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading
to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize
∆t is also invariant to the scale of the gradients; rescaling the gradients g with factor c will scale m̂t
with a factor c and v̂t with a factor c2, which cancel out: (c · m̂t)/(
√
c2 · v̂t) = m̂t/
√
v̂t.
3 INITIALIZATION BIAS CORRECTION
As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive
the term for the second moment estimate; the derivation for the first moment estimate is completely
analogous. Let g be the gradient of the stochastic objective f , and we wish to estimate its second
raw moment (uncentered variance) using an exponential moving average of the squared gradient,
with decay rate β2. Let g1, ..., gT be the gradients at subsequent timesteps, each a draw from an
underlying gradient distribution gt ∼ p(gt). Let us initialize the exponential moving average as
v0 = 0 (a vector of zeros). First note that the update at timestep t of the exponential moving average
vt = β2 · vt−1 + (1− β2) · g2t (where g2t indicates the elementwise square gt� gt) can be written as
a function of the gradients at all previous timesteps:
vt = (1− β2)
t∑
i=1
βt−i2 · g2i (1)
We wish to know how E[vt], the expected value of the exponential moving average at timestep t,
relates to the true second moment E[g2t ], so we can correct for the discrepancy between the two.
Taking expectations of the left-hand and right-hand sides of eq. (1):
E[vt] = E
[
(1− β2)
t∑
i=1
βt−i2 · g2i
]
(2)
= E[g2t ] · (1− β2)
t∑
i=1
βt−i2 + ζ (3)
= E[g2t ] · (1− βt2) + ζ (4)
where ζ = 0 if the true second moment E[g2i ] is stationary; otherwise ζ can be kept small since
the exponential decay rate β1 can (and should) be chosen such that the exponential moving average
assigns small weights to gradients too far in the past. What is left is the term (1 − βt2) which is
caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this
term to correct the initialization bias.
In case of sparse gradients, for a reliable estimate of the second moment one needs to average over
many gradients by chosing a small value of β2; however it is exactly this case of small β2 where a
lack of initialisation bias correction would lead to initial steps that are much larger.
3
Published as a conference paper at ICLR 2015
4 CONVERGENCE ANALYSIS
We analyze the convergence of Adam using the online learning framework proposed in (Zinkevich,
2003). Given an arbitrary, unknown sequence of convex cost functions f1(θ), f2(θ),..., fT (θ). At
each time t, our goal is to predict the parameter θt and evaluate it on a previously unknown cost
function ft. Since the nature of the sequence is unknown in advance, we evaluate our algorithm
using the regret, that is the sum of all the previous difference between the online prediction ft(θt)
and the best fixed point parameter ft(θ∗) from a feasible set X for all the previous steps. Concretely,
the regret is defined as:
R(T ) =
T∑
t=1
[ft(θt)− ft(θ∗)] (5)
where θ∗ = arg minθ∈X
∑T
t=1 ft(θ). We show Adam hasO(
√
T ) regret bound and a proof is given
in the appendix. Our result is comparable to the best known bound for this general convex online
learning problem. We also use some definitions simplify our notation, where gt , ∇ft(θt) and gt,i
as the ith element. We define g1:t,i ∈ Rt as a vector that contains the ith dimension of the gradients
over all iterations till t, g1:t,i = [g1,i, g2,i, · · · , gt,i]. Also, we define γ , β
2
1√
β2
. Our following
theorem holds when the learning rate αt is decaying at a rate of t−
1
2 and first moment running
average coefficient β1,t decay exponentially with λ, that is typically close to 1, e.g. 1− 10−8.
Theorem 4.1. Assume that the function ft has bounded gradients, ‖∇ft(θ)‖2 ≤ G, ‖∇ft(θ)‖∞ ≤
G∞ for all θ ∈ Rd and distance between any θt generated by Adam is bounded, ‖θn − θm‖2 ≤ D,
‖θm − θn‖∞ ≤ D∞ for any m,n ∈ {1, ..., T}, and β1, β2 ∈ [0, 1) satisfy β
2
1√
β2
< 1. Let αt =
α√
t
and β1,t = β1λt−1, λ ∈ (0, 1). Adam achieves the following guarantee, for all T ≥ 1.
R(T ) ≤ D
2
2α(1− β1)
d∑
i=1
√
T v̂T,i+
α(1 + β1)G∞
(1− β1)
√
1− β2(1− γ)2
d∑
i=1
‖g1:T,i‖2+
d∑
i=1
D2∞G∞
√
1− β2
2α(1− β1)(1− λ)2
Our Theorem 4.1 implies when the data features are sparse and bounded gradients, the sum-
mation term can be much smaller than its upper bound
∑d
i=1 ‖g1:T,i‖2 << dG∞
√
T and∑d
i=1
√
T v̂T,i << dG∞
√
T , in particular if the class of function and data features are in the form of
section 1.2 in (Duchi et al., 2011). Their results for the expected value E[
∑d
i=1 ‖g1:T,i‖2] also apply
to Adam. In particular, the adaptive method, such as Adam and Adagrad, can achieve O(log d
√
T ),
an improvement over O(
√
dT ) for the non-adaptive method. Decaying β1,t towards zero is impor-
tant in our theoretical analysis and also matches previous empirical findings, e.g. (Sutskever et al.,
2013) suggests reducing the momentum coefficient in the end of training can improve convergence.
Finally, we can show the average regret of Adam converges,
Corollary 4.2. Assume that the function ft has bounded gradients, ‖∇ft(θ)‖2 ≤ G, ‖∇ft(θ)‖∞ ≤
G∞ for all θ ∈ Rd and distance between any θt generated by Adam is bounded, ‖θn − θm‖2 ≤ D,
‖θm − θn‖∞ ≤ D∞ for any m,n ∈ {1, ..., T}. Adam achieves the following guarantee, for all
T ≥ 1.
R(T )
T
= O(
1√
T
)
This result can be obtained by using Theorem 4.1 and
∑d
i=1 ‖g1:T,i‖2 ≤ dG∞
√
T . Thus,
limT→∞
R(T )
T
= 0.
5 RELATED WORK
Optimization methods bearing a direct relation to Adam are RMSProp (Tieleman & Hinton, 2012;
Graves, 2013) and AdaGrad (Duchi et al., 2011); these relationships are discussed below. Other
stochastic optimization methods include vSGD (Schaul et al., 2012), AdaDelta (Zeiler, 2012) and the
natural Newton method from Roux & Fitzgibbon (2010), all setting stepsizes by estimating curvature
4
Published as a conference paper at ICLR 2015
from first-order information. The Sum-of-Functions Optimizer (SFO) (Sohl-Dickstein et al., 2014)
is a quasi-Newton method based on minibatches, but (unlike Adam) has memory requirements linear
in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained
systems such as a GPU. Like natural gradient descent (NGD) (Amari, 1998), Adam employs a
preconditioner that adapts to the geometry of the data, since v̂t is an approximation to the diagonal
of the Fisher information matrix (Pascanu & Bengio, 2013); however, Adam’s preconditioner (like
AdaGrad’s) is more conservative in its adaption than vanilla NGD by preconditioning with the square
root of the inverse of the diagonal Fisher information matrix approximation.
RMSProp: An optimization method closely related to Adam is RMSProp (Tieleman & Hinton,
2012). A version with momentum has sometimes been used (Graves, 2013). There are a few impor-
tant differences between RMSProp with momentum and Adam: RMSProp with momentum gener-
ates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are
directly estimated using a running average of first and second moment of the gradient. RMSProp
also lacks a bias-correction term; this matters most in case of a value of β2 close to 1 (required in
case of sparse gradients), since in that case not correcting the bias leads to very large stepsizes and
often divergence, as we also empirically demonstrate in section 6.4.
AdaGrad: An algorithm that works well for sparse gradients is AdaGrad (Duchi et al., 2011). Its
basic version updates parameters as θt+1 = θt −α · gt/
√∑t
i=1 g
2
t . Note that if we choose β2 to be
infinitesimally close to 1 from below, then limβ2→1 v̂t = t
−1 ·
∑t
i=1 g
2
t . AdaGrad corresponds to a
version of Adam with β1 = 0, infinitesimal (1− β2) and a replacement of α by an annealed version
αt = α · t−1/2, namely θt − α · t−1/2 · m̂t/
√
limβ2→1 v̂t = θt − α · t−1/2 · gt/
√
t−1 ·
∑t
i=1 g
2
t =
θt − α · gt/
√∑t
i=1 g
2
t . Note that this direct correspondence between Adam and Adagrad does
not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a β2
infinitesimally close to 1 would lead to infinitely large bias, and infinitely large parameter updates.
6 EXPERIMENTS
To empirically evaluate the proposed method, we investigated different popular machine learning
models, including logistic regression, multilayer fully connected neural networks and deep convolu-
tional neural networks. Using large models and datasets, we demonstrate Adam can efficiently solve
practical deep learning problems.
We use the same parameter initialization when comparing different optimization algorithms. The
hyper-parameters, such as learning rate and momentum, are searched over a dense grid and the
results are reported using the best hyper-parameter setting.
6.1 EXPERIMENT: LOGISTIC REGRESSION
We evaluate our proposed method on L2-regularized multi-class logistic regression using the MNIST
dataset. Logistic regression has a well-studied convex objective, making it suitable for comparison
of different optimizers without worrying about local minimum issues. The stepsize α in our logistic
regression experiments is adjusted by 1/
√
t decay, namely αt =
α√
t
that matches with our theorat-
ical prediction from section 4. The logistic regression classifies the class label directly on the 784
dimension image vectors. We compare Adam to accelerated SGD with Nesterov momentum and
Adagrad using minibatch size of 128. According to Figure 1, we found that the Adam yields similar
convergence as SGD with momentum and both converge faster than Adagrad.
As discussed in (Duchi et al., 2011), Adagrad can efficiently deal with sparse features and gradi-
ents as one of its main theoretical results whereas SGD is low at learning rare features. Adam with
1/
√
t decay on its stepsize should theoratically match the performance of Adagrad. We examine the
sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process
the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most
frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse. As sug-
gested in (Wang & Manning, 2013), 50% dropout noise can be applied to the BoW features during
5
Published as a conference paper at ICLR 2015
0 5 10 15 20 25 30 35 40 45
iterations over entire dataset
0.2
0.3
0.4
0.5
0.6
0.7
tr
a
in
in
g
c
o
st
MNIST Logistic Regression
AdaGrad
SGDNesterov
Adam
0 20 40 60 80 100 120 140 160
iterations over entire dataset
0.20
0.25
0.30
0.35
0.40
0.45
0.50
tr
ai
ni
ng
c
os
t
IMDB BoW feature Logistic Regression
Adagrad+dropout
RMSProp+dropout
SGDNesterov+dropout
Adam+dropout
Figure 1: Logistic regression training negative log likelihood on MNIST images and IMDB movie
reviews with 10,000 bag-of-words (BoW) feature vectors.
training to prevent over-fitting. In figure 1, Adagrad outperforms SGD with Nesterov momentum
by a large margin both with and without dropout noise. Adam converges as fast as Adagrad. The
empirical performance of Adam is consistent with our theoretical findings in sections 2 and 4. Sim-
ilar to Adagrad, Adam can take advantage of sparse features and obtain faster convergence rate than
normal SGD with momentum.
6.2 EXPERIMENT: MULTI-LAYER NEURAL NETWORKS
Multi-layer neural network are powerful models with non-convex objective functions. Although
our convergence analysis does not apply to non-convex problems, we empirically found that Adam
often outperforms other methods in such cases. In our experiments, we made model choices that are
consistent with previous publications in the area; a neural network model with two fully connected
hidden layers with 1000 hidden units each and ReLU activation are used for this experiment with
minibatch size of 128.
First, we study different optimizers using the standard deterministic cross-entropy objective func-
tion with L2 weight decay on the parameters to prevent over-fitting. The sum-of-functions (SFO)
method (Sohl-Dickstein et al., 2014) is a recently proposed quasi-Newton method that works with
minibatches of data and has shown good performance on optimization of multi-layer neural net-
works. We used their implementation and compared with Adam to train such models. Figure 2
shows that Adam makes faster progress in terms of both the number of iterations and wall-clock
time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration com-
pared to Adam, and has a memory requirement that is linear in the number minibatches.
Stochastic regularization methods, such as dropout, are an effective way to prevent over-fitting and
often used in practice due to their simplicity. SFO assumes deterministic subfunctions, and indeed
failed to converge on cost functions with stochastic regularization. We compare the effectiveness of
Adam to other stochastic first order methods on multi-layer neural networks trained with dropout
noise. Figure 2 shows our results; Adam shows better convergence than other methods.
6.3 EXPERIMENT: CONVOLUTIONAL NEURAL NETWORKS
Convolutional neural networks (CNNs) with several layers of convolution, pooling and non-linear
units have shown considerable success in computer vision tasks. Unlike most fully connected neural
nets, weight sharing in CNNs results in vastly different gradients in different layers. A smaller
learning rate for the convolution layers is often used in practice when applying SGD. We show the
effectiveness of Adam in deep CNNs. Our CNN architecture has three alternating stages of 5x5
convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer
of 1000 rectified linear hidden units (ReLU’s). The input image are pre-processed by whitening, and
6
Published as a conference paper at ICLR 2015
0 50 100 150 200
iterations over entire dataset
10-2
10-1
tr
ai
ni
ng
c
os
t
MNIST Multilayer Neural Network + dropout
AdaGrad
RMSProp
SGDNesterov
AdaDelta
Adam
(a) (b)
Figure 2: Training of multilayer neural networks on MNIST images. (a) Neural networks using
dropout stochastic regularization. (b) Neural networks with deterministic cost function. We compare
with the sum-of-functions (SFO) optimizer (Sohl-Dickstein et al., 2014)
0.0 0.5 1.0 1.5 2.0 2.5 3.0
iterations over entire dataset
0.5
1.0
1.5
2.0
2.5
3.0
tr
a
in
in
g
c
o
st
CIFAR10 ConvNet First 3 Epoches
AdaGrad
AdaGrad+dropout
SGDNesterov
SGDNesterov+dropout
Adam
Adam+dropout
0 5 10 15 20 25 30 35 40 45
iterations over entire dataset
10
-4
10
-3
10
-2
10
-1
10
0
10
1
10
2
tr
a
in
in
g
c
o
st
CIFAR10 ConvNet
AdaGrad
AdaGrad+dropout
SGDNesterov
SGDNesterov+dropout
Adam
Adam+dropout
Figure 3: Convolutional neural networks training cost. (left) Training cost for the first three epochs.
(right) Training cost over 45 epochs. CIFAR-10 with c64-c64-c128-1000 architecture.
dropout noise is applied to the input layer and fully connected layer. The minibatch size is also set
to 128 similar to previous experiments.
Interestingly, although both Adam and Adagrad make rapid progress lowering the cost in the initial
stage of the training, shown in Figure 3 (left), Adam and SGD eventually converge considerably
faster than Adagrad for CNNs shown in Figure 3 (right). We notice the second moment estimate v̂t
vanishes to zeros after a few epochs and is dominated by the � in algorithm 1. The second moment
estimate is therefore a poor approximation to the geometry of the cost function in CNNs comparing
to fully connected network from Section 6.2. Whereas, reducing the minibatch variance through
the first moment is more important in CNNs and contributes to the speed-up. As a result, Adagrad
converges much slower than others in this particular experiment. Though Adam shows marginal
improvement over SGD with momentum, it adapts learning rate scale for different layers instead of
hand picking manually as in SGD.
7
Published as a conference paper at ICLR 2015
β1=0
β1=0.9
β2=0.99 β2=0.999 β2=0.9999 β2=0.99 β2=0.999 β2=0.9999
(a) after 10 epochs (b) after 100 epochs
log10(α)
Lo
ss
Figure 4: Effect of bias-correction terms (red line) versus no bias correction terms (green line)
after 10 epochs (left) and 100 epochs (right) on the loss (y-axes) when learning a Variational Auto-
Encoder (VAE) (Kingma & Welling, 2013), for different settings of stepsize α (x-axes) and hyper-
parameters β1 and β2.
6.4 EXPERIMENT: BIAS-CORRECTION TERM
We also empirically evaluate the effect of the bias correction terms explained in sections 2 and 3.
Discussed in section 5, removal of the bias correction terms results in a version of RMSProp (Tiele-
man & Hinton, 2012) with momentum. We vary the β1 and β2 when training a variational auto-
encoder (VAE) with the same architecture as in (Kingma & Welling, 2013) with a single hidden
layer with 500 hidden units with softplus nonlinearities and a 50-dimensional spherical Gaussian
latent variable. We iterated over a broad range of hyper-parameter choices, i.e. β1 ∈ [0, 0.9] and
β2 ∈ [0.99, 0.999, 0.9999], and log10(α) ∈ [−5, ...,−1]. Values of β2 close to 1, required for robust-
ness to sparse gradients, results in larger initialization bias; therefore we expect the bias correction
term is important in such cases of slow decay, preventing an adverse effect on optimization.
In Figure 4, values β2 close to 1 indeed lead to instabilities in training when no bias correction term
was present, especially at first few epochs of the training. The best results were achieved with small
values of (1−β2) and bias correction; this was more apparent towards the end of optimization when
gradients tends to become sparser as hidden units specialize to specific patterns. In summary, Adam
performed equal or better than RMSProp, regardless of hyper-parameter setting.
7 EXTENSIONS
7.1 ADAMAX
In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a
(scaled)L2 norm of their individual current and past gradients. We can generalize theL2 norm based
update rule to a Lp norm based update rule. Such variants become numerically unstable for large
p. However, in the special case where we let p → ∞, a surprisingly simple and stable algorithm
emerges; see algorithm 2. We’ll now derive the algorithm. Let, in case of the Lp norm, the stepsize
at time t be inversely proportional to v1/pt , where:
vt = β
p
2vt−1 + (1− β
p
2)|gt|p (6)
= (1− βp2)
t∑
i=1
β
p(t−i)
2 · |gi|p (7)
8
Published as a conference paper at ICLR 2015
Algorithm 2: AdaMax, a variant of Adam based on the infinity norm. See section 7.1 for details.
Good default settings for the tested machine learning problems are α = 0.002, β1 = 0.9 and
β2 = 0.999. With βt1 we denote β1 to the power t. Here, (α/(1− βt1)) is the learning rate with the
bias-correction term for the first moment. All operations on vectors are element-wise.
Require: α: Stepsize
Require: β1, β2 ∈ [0, 1): Exponential decay rates
Require: f(θ): Stochastic objective function with parameters θ
Require: θ0: Initial parameter vector
m0 ← 0 (Initialize 1st moment vector)
u0 ← 0 (Initialize the exponentially weighted infinity norm)
t← 0 (Initialize timestep)
while θt not converged do
t← t+ 1
gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 ·mt−1 + (1− β1) · gt (Update biased first moment estimate)
ut ← max(β2 · ut−1, |gt|) (Update the exponentially weighted infinity norm)
θt ← θt−1 − (α/(1− βt1)) ·mt/ut (Update parameters)
end while
return θt (Resulting parameters)
Note that the decay term is here equivalently parameterised as βp2 instead of β2. Now let p → ∞,
and define ut = limp→∞(vt)1/p, then:
ut = lim
p→∞
(vt)
1/p = lim
p→∞
(
(1− βp2)
t∑
i=1
β
p(t−i)
2 · |gi|p
)1/p
(8)
= lim
p→∞
(1− βp2)1/p
(
t∑
i=1
β
p(t−i)
2 · |gi|p
)1/p
(9)