What I will test next #1

ducha-aiki · 2015-12-30T11:37:59Z

Continue random walk on ResNets - to understand how to train them properly. There is definitely somewhere problem I cannot see :(
~~Pooling: AVG-pooling caffenet Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree http://arxiv.org/abs/1509.08985 -- all three from paper, thanks authors for code.~~
Regularization: ~~weight decay values , L1\L2 weights decay, dropout rates~~.
~~Freeze conv structure and play with fc6-fc8 classifier. Maxout? More layers? Convolution? Inspired by http://arxiv.org/abs/1504.06066, but in end-to-end style.~~
~~Solvers: default caffenet + ADAM\RMSProp\ ~~Nesterov~~\ ~~"poly" policy~~~~
~~BatchNorm for blocks of layers, not each.~~
-~~For fully convolutional nets, what is better - avg pool on features, then classifier, or other way round</~~

~~SqueezeNet https://github.com/DeepScale/SqueezeNet~~

(very future) how best choices stacks? I.e. BN+20% dropout + best activation + best solver + ...

~~Suggestions and training logs from community are welcomed.~~

ghost · 2016-01-01T00:52:24Z

Is this worth testing?

"Deep Learning with S-shaped Rectified Linear Activation Units"

http://arxiv.org/abs/1512.07030

ducha-aiki · 2016-01-01T11:18:17Z

Well, I don`t like activations which could not be done "in-place", but if SReLU will be implemented in caffe, I test it.

Darwin2011 · 2016-01-19T08:12:14Z

@ducha-aiki

I learn a lot from your this caffenet-benchmark, is there anything I can try to speedup the benchmarks?
Thanks

ducha-aiki · 2016-01-19T08:45:41Z

@Darwin2011
Thank you!
I can see two ways you can help:
1)Do your own benchmark and make PR like @dereyly done.
2)In caffenet-style networks, the real bottleneck, as I recently found is not GPU, but input data layer BVLC/caffe#2252, which does jpeg decompression AND real-time rescaling. Its multicore implementation would help much.

bhack · 2016-02-03T20:15:58Z

Take a look at https://github.com/KaimingHe/deep-residual-networks/

ducha-aiki · 2016-02-03T20:35:28Z

@bhack thanks, as usual ;)

bhack · 2016-02-05T11:43:44Z

And http://torch.ch/blog/2016/02/04/resnets.html

ducha-aiki · 2016-02-05T11:57:58Z

@bhack yes, I have contacted that guy :) Now testing ResNet101 (as in paper, + without scale/bias, + without BN) without last block, because activation size is too small for it.

1adrianb · 2016-03-30T07:05:56Z

@ducha-aiki, Did you had any look with the training of ResNet on caffe? Thanks

ducha-aiki · 2016-03-30T07:22:41Z

@1adrianb there is even one successful attempt here https://github.com/ducha-aiki/caffenet-benchmark/blob/master/ResNets.md
But at the most my trials, they overfit a lot. Which probably means that 128 px is not enough for them.
And I think that after paper "Identity Mappings in Deep Residual Networks" http://arxiv.org/abs/1603.05027 I literally haven't anything to test in ResNets setup which authors don`t check. So I refer you to this excellent paper.

1adrianb · 2016-03-30T09:21:34Z

@ducha-aiki thanks a lot for your prompt answer and for the recommended paper.
I saw in your implementation that you are not using the Scale layer to learn the parameters, is there any reason for not doing so, like in the model posted by Kaiming He?

ducha-aiki · 2016-03-30T15:53:02Z

@1adrianb
I use it...sometimes. Here are tests of the batchnorm and some of them use scale bias, while others not. Reason - test them all :) https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

1adrianb · 2016-03-30T17:09:04Z

@ducha-aiki makes sense :) Thanks again for your help!

ducha-aiki · 2016-05-23T09:55:58Z

Now vgg16 with all tricks is in training (to check if they for weaker models only)

wangxianliang · 2016-05-26T14:03:01Z

Hi, have you tried pre-activation resnet described in paper http://arxiv.org/abs/1603.05027?

And what about Inception-ResNet-v2 described in paper http://arxiv.org/abs/1602.07261?

ducha-aiki · 2016-05-27T07:46:04Z

@wangxianliang
Well, not and not planning to do it in the nearest future. Reasons:
1)This benchmark originally is test of "we propose cool thing and test it on CIFAR-100" papers. Or to evaluate missing design
choices. The both papers you give, have make a good evaluation of full ImageNet.
2) They both are very time-consuming. Yes, I have tested vggnet and googlenet for reference, but mostly cheaper network.
3) They have a complex structure...so if you could write a prototxt, this will greatly increase chances that I will run them in meantime :) Will be grateful for help.

ibmua · 2016-06-02T10:22:24Z

It's a good idea to test influence of LSUV init time batch sizes on large networks with highly variant data. It seems in the paper that you've only tested this for some tanh tiny Cifar-10 net with some relatively tiny batches and even then a positive correlation between a batch size and performance could be seen. It seems obvious, that it is of much greater influence when it comes to large networks and largely variable data with, say, 1000s of classification outputs. For example, I'm currently training a net for classifying 33.5k mouse-painted symbols. That Cifar-10 data tells me nothing.

Testing 6 different batches on my data:
poolconv_2 variance = 0.0515101 mean = 0.206088
poolconv_2 variance = 0.0517261 mean = 0.206058
poolconv_2 variance = 0.0521883 mean = 0.206072
poolconv_2 variance = 0.989368 mean = 0.22629
poolconv_2 variance = 0.995676 mean = 0.226181
poolconv_2 variance = 0.998411 mean = 0.22653

One of the reasons why testing batch influence is so important is that if it really improves things, we can switch to computing LSUV variance not via mini-batches, but via iterations. And use, say, 100k batch rather than 1000.
And people don't really use tanh much nowadays, so.. personally I don't really get why you used that one for batch size benchmarking. Tells about nothing, really.

ducha-aiki · 2016-06-02T12:35:49Z

I have used tanh, because LSUV worked the worst for it. My experience with ImageNet confirms, that any batch size > 32 is OK for LSUV, if data is shuffled. Your data looks imbalanced or not shuffled.
If you will do such test on your dataset, I will definitely add to to this benchmark :)

ibmua · 2016-06-02T20:53:16Z

Yes, it actually is not very shuffled. Which means that I have to use a larger batch here to get something more like I would get with a smaller one if it was shuffled.

The problem with benchmarking is that I don't have the required stuff setup on my computer and it would take me several hours to do that.

ibmua · 2016-07-05T02:47:13Z

It would be great if you noted time it took to converge different stuff. Both epochs and actual time.

ducha-aiki · 2016-07-05T07:55:08Z

@ibmua for epochs it is very easy - 320K everywhere :) As for times - it is hard (I am lazy) because some of the trainings consist of lots of save-load with pauses between them. But I am going to do this...once :)

ibmua · 2016-07-05T12:10:37Z

Yeah, I guess that would just take

import time
start = time.time()
end = time.time()

, logging spent time to a seperate file during saves and reading during loads. Not too hard. Would give a better sense of difference between different methods. For example, training with momentum is great, but it takes more time.

ducha-aiki · 2016-07-05T12:25:29Z

@ibmua

import time
start = time.time()
end = time.time()

Are you really thinking I do it in python? You are underestimating my laziness ;) Everything I do is editing *.prototxt, then run caffe train --solver=current_net.prototxt > logfile.log
Then I call https://github.com/BVLC/caffe/blob/master/tools/extra/plot_training_log.py.example file with default keys to get a graphs :)

ibmua · 2016-07-05T12:41:23Z

Wow, okay. Guess that would take editing plot_training_log.py.example then.

ibmua · 2016-07-05T15:47:35Z

Actually, easier. Just measure a single forward-backwards pass on the side and then multiply by #epochs from the log. Having info whether the things converged or you just cut it off because it would take >320k iterations would help too. Also, graphs would be a pleasant addition.

ibmua · 2016-07-29T21:55:31Z

It's interesting to find out how ResNets perform for different activations. PReLU was coauthored by Kaiming He - coauthor of ResNets, but for some reason he used ReLUs in ResNets.

ducha-aiki · 2017-06-19T07:51:53Z

CReLU is on the way https://arxiv.org/pdf/1603.05201.pdf

wangg12 · 2017-06-19T09:58:31Z

How about SELU @ducha-aiki ?

ducha-aiki · 2017-06-19T10:10:46Z

@wangg12 It is already there https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Activations.md
worse than ELU. But may be I have made a mistake, you can check me in https://github.com/ducha-aiki/caffenet-benchmark/blob/master/prototxt/activations/caffenet128_lsuv_SELU.prototxt#L92

simbaforrest · 2017-08-21T03:49:03Z

@ducha-aiki In the SELU paper they seem to suggest using MSRA-like initialization for weights, while your prototxt seem to use fixed Gaussian initialization? Could this affect the result?

ducha-aiki · 2017-08-21T06:15:59Z

@simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :)
What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA.

simbaforrest · 2017-08-21T06:21:21Z

Got it, so I guess selu might be only good for FNN instead of CNN? ...... 2017年8月21日 02:16，"Dmytro Mishkin" <[email protected]>写道：

…

@simbaforrest <https://github.com/simbaforrest> I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :) What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAuQpo3O3CLsqSo66oWkJtYYEtEsdKZqks5saSCggaJpZM4G8wV8> .

ducha-aiki · 2017-08-21T06:23:40Z

Well, it is still possible that SELU is a very architecture dependent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What I will test next #1

What I will test next #1

ducha-aiki commented Dec 30, 2015

ghost commented Jan 1, 2016

ducha-aiki commented Jan 1, 2016

Darwin2011 commented Jan 19, 2016

ducha-aiki commented Jan 19, 2016

bhack commented Feb 3, 2016

ducha-aiki commented Feb 3, 2016

bhack commented Feb 5, 2016

ducha-aiki commented Feb 5, 2016

1adrianb commented Mar 30, 2016

ducha-aiki commented Mar 30, 2016

1adrianb commented Mar 30, 2016

ducha-aiki commented Mar 30, 2016

1adrianb commented Mar 30, 2016

ducha-aiki commented May 23, 2016

wangxianliang commented May 26, 2016 •

edited

Loading

ducha-aiki commented May 27, 2016

ibmua commented Jun 2, 2016 •

edited

Loading

ducha-aiki commented Jun 2, 2016

ibmua commented Jun 2, 2016 •

edited

Loading

ibmua commented Jul 5, 2016

ducha-aiki commented Jul 5, 2016

ibmua commented Jul 5, 2016

ducha-aiki commented Jul 5, 2016

ibmua commented Jul 5, 2016

ibmua commented Jul 5, 2016 •

edited

Loading

ibmua commented Jul 29, 2016

ducha-aiki commented Jun 19, 2017

wangg12 commented Jun 19, 2017

ducha-aiki commented Jun 19, 2017

simbaforrest commented Aug 21, 2017

ducha-aiki commented Aug 21, 2017

simbaforrest commented Aug 21, 2017 via email

ducha-aiki commented Aug 21, 2017

What I will test next #1

What I will test next #1

Comments

ducha-aiki commented Dec 30, 2015

ghost commented Jan 1, 2016

ducha-aiki commented Jan 1, 2016

Darwin2011 commented Jan 19, 2016

ducha-aiki commented Jan 19, 2016

bhack commented Feb 3, 2016

ducha-aiki commented Feb 3, 2016

bhack commented Feb 5, 2016

ducha-aiki commented Feb 5, 2016

1adrianb commented Mar 30, 2016

ducha-aiki commented Mar 30, 2016

1adrianb commented Mar 30, 2016

ducha-aiki commented Mar 30, 2016

1adrianb commented Mar 30, 2016

ducha-aiki commented May 23, 2016

wangxianliang commented May 26, 2016 • edited Loading

ducha-aiki commented May 27, 2016

ibmua commented Jun 2, 2016 • edited Loading

ducha-aiki commented Jun 2, 2016

ibmua commented Jun 2, 2016 • edited Loading

ibmua commented Jul 5, 2016

ducha-aiki commented Jul 5, 2016

ibmua commented Jul 5, 2016

ducha-aiki commented Jul 5, 2016

ibmua commented Jul 5, 2016

ibmua commented Jul 5, 2016 • edited Loading

ibmua commented Jul 29, 2016

ducha-aiki commented Jun 19, 2017

wangg12 commented Jun 19, 2017

ducha-aiki commented Jun 19, 2017

simbaforrest commented Aug 21, 2017

ducha-aiki commented Aug 21, 2017

simbaforrest commented Aug 21, 2017 via email

ducha-aiki commented Aug 21, 2017

wangxianliang commented May 26, 2016 •

edited

Loading

ibmua commented Jun 2, 2016 •

edited

Loading

ibmua commented Jun 2, 2016 •

edited

Loading

ibmua commented Jul 5, 2016 •

edited

Loading