Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What I will test next #1

Open
ducha-aiki opened this issue Dec 30, 2015 · 33 comments
Open

What I will test next #1

ducha-aiki opened this issue Dec 30, 2015 · 33 comments

Comments

@ducha-aiki
Copy link
Owner

  • Continue random walk on ResNets - to understand how to train them properly. There is definitely somewhere problem I cannot see :(
  • Pooling: AVG-pooling caffenet Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree http://arxiv.org/abs/1509.08985 -- all three from paper, thanks authors for code.
  • Regularization: weight decay values , L1\L2 weights decay, dropout rates.
  • Freeze conv structure and play with fc6-fc8 classifier. Maxout? More layers? Convolution? Inspired by http://arxiv.org/abs/1504.06066, but in end-to-end style.
  • Solvers: default caffenet + ADAM\RMSProp\ Nesterov\ "poly" policy
  • BatchNorm for blocks of layers, not each.
    -For fully convolutional nets, what is better - avg pool on features, then classifier, or other way round</
  • SqueezeNet https://github.com/DeepScale/SqueezeNet
  • (very future) how best choices stacks? I.e. BN+20% dropout + best activation + best solver + ...

Suggestions and training logs from community are welcomed.

@ghost
Copy link

ghost commented Jan 1, 2016

Is this worth testing?

"Deep Learning with S-shaped Rectified Linear Activation Units"

http://arxiv.org/abs/1512.07030

@ducha-aiki
Copy link
Owner Author

Well, I don`t like activations which could not be done "in-place", but if SReLU will be implemented in caffe, I test it.

@Darwin2011
Copy link

@ducha-aiki

I learn a lot from your this caffenet-benchmark, is there anything I can try to speedup the benchmarks?
Thanks

@ducha-aiki
Copy link
Owner Author

@Darwin2011
Thank you!
I can see two ways you can help:
1)Do your own benchmark and make PR like @dereyly done.
2)In caffenet-style networks, the real bottleneck, as I recently found is not GPU, but input data layer BVLC/caffe#2252, which does jpeg decompression AND real-time rescaling. Its multicore implementation would help much.

@bhack
Copy link

bhack commented Feb 3, 2016

@ducha-aiki
Copy link
Owner Author

@bhack thanks, as usual ;)

@bhack
Copy link

bhack commented Feb 5, 2016

@ducha-aiki
Copy link
Owner Author

@bhack yes, I have contacted that guy :) Now testing ResNet101 (as in paper, + without scale/bias, + without BN) without last block, because activation size is too small for it.

@1adrianb
Copy link

@ducha-aiki, Did you had any look with the training of ResNet on caffe? Thanks

@ducha-aiki
Copy link
Owner Author

@1adrianb there is even one successful attempt here https://github.com/ducha-aiki/caffenet-benchmark/blob/master/ResNets.md
But at the most my trials, they overfit a lot. Which probably means that 128 px is not enough for them.
And I think that after paper "Identity Mappings in Deep Residual Networks" http://arxiv.org/abs/1603.05027 I literally haven't anything to test in ResNets setup which authors don`t check. So I refer you to this excellent paper.

@1adrianb
Copy link

@ducha-aiki thanks a lot for your prompt answer and for the recommended paper.
I saw in your implementation that you are not using the Scale layer to learn the parameters, is there any reason for not doing so, like in the model posted by Kaiming He?

@ducha-aiki
Copy link
Owner Author

@1adrianb
I use it...sometimes. Here are tests of the batchnorm and some of them use scale bias, while others not. Reason - test them all :) https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md

@1adrianb
Copy link

@ducha-aiki makes sense :) Thanks again for your help!

@ducha-aiki
Copy link
Owner Author

Now vgg16 with all tricks is in training (to check if they for weaker models only)

@wangxianliang
Copy link

wangxianliang commented May 26, 2016

Hi, have you tried pre-activation resnet described in paper http://arxiv.org/abs/1603.05027?

And what about Inception-ResNet-v2 described in paper http://arxiv.org/abs/1602.07261?

@ducha-aiki
Copy link
Owner Author

@wangxianliang
Well, not and not planning to do it in the nearest future. Reasons:
1)This benchmark originally is test of "we propose cool thing and test it on CIFAR-100" papers. Or to evaluate missing design
choices. The both papers you give, have make a good evaluation of full ImageNet.
2) They both are very time-consuming. Yes, I have tested vggnet and googlenet for reference, but mostly cheaper network.
3) They have a complex structure...so if you could write a prototxt, this will greatly increase chances that I will run them in meantime :) Will be grateful for help.

@ibmua
Copy link
Contributor

ibmua commented Jun 2, 2016

It's a good idea to test influence of LSUV init time batch sizes on large networks with highly variant data. It seems in the paper that you've only tested this for some tanh tiny Cifar-10 net with some relatively tiny batches and even then a positive correlation between a batch size and performance could be seen. It seems obvious, that it is of much greater influence when it comes to large networks and largely variable data with, say, 1000s of classification outputs. For example, I'm currently training a net for classifying 33.5k mouse-painted symbols. That Cifar-10 data tells me nothing.

Testing 6 different batches on my data:
poolconv_2 variance = 0.0515101 mean = 0.206088
poolconv_2 variance = 0.0517261 mean = 0.206058
poolconv_2 variance = 0.0521883 mean = 0.206072
poolconv_2 variance = 0.989368 mean = 0.22629
poolconv_2 variance = 0.995676 mean = 0.226181
poolconv_2 variance = 0.998411 mean = 0.22653

One of the reasons why testing batch influence is so important is that if it really improves things, we can switch to computing LSUV variance not via mini-batches, but via iterations. And use, say, 100k batch rather than 1000.
And people don't really use tanh much nowadays, so.. personally I don't really get why you used that one for batch size benchmarking. Tells about nothing, really.

@ducha-aiki
Copy link
Owner Author

I have used tanh, because LSUV worked the worst for it. My experience with ImageNet confirms, that any batch size > 32 is OK for LSUV, if data is shuffled. Your data looks imbalanced or not shuffled.
If you will do such test on your dataset, I will definitely add to to this benchmark :)

@ibmua
Copy link
Contributor

ibmua commented Jun 2, 2016

Yes, it actually is not very shuffled. Which means that I have to use a larger batch here to get something more like I would get with a smaller one if it was shuffled.

The problem with benchmarking is that I don't have the required stuff setup on my computer and it would take me several hours to do that.

@ibmua
Copy link
Contributor

ibmua commented Jul 5, 2016

It would be great if you noted time it took to converge different stuff. Both epochs and actual time.

@ducha-aiki
Copy link
Owner Author

@ibmua for epochs it is very easy - 320K everywhere :) As for times - it is hard (I am lazy) because some of the trainings consist of lots of save-load with pauses between them. But I am going to do this...once :)

@ibmua
Copy link
Contributor

ibmua commented Jul 5, 2016

Yeah, I guess that would just take

import time
start = time.time()
end = time.time()

, logging spent time to a seperate file during saves and reading during loads. Not too hard. Would give a better sense of difference between different methods. For example, training with momentum is great, but it takes more time.

@ducha-aiki
Copy link
Owner Author

@ibmua

import time
start = time.time()
end = time.time()

Are you really thinking I do it in python? You are underestimating my laziness ;) Everything I do is editing *.prototxt, then run caffe train --solver=current_net.prototxt > logfile.log
Then I call https://github.com/BVLC/caffe/blob/master/tools/extra/plot_training_log.py.example file with default keys to get a graphs :)

@ibmua
Copy link
Contributor

ibmua commented Jul 5, 2016

Wow, okay. Guess that would take editing plot_training_log.py.example then.

@ibmua
Copy link
Contributor

ibmua commented Jul 5, 2016

Actually, easier. Just measure a single forward-backwards pass on the side and then multiply by #epochs from the log. Having info whether the things converged or you just cut it off because it would take >320k iterations would help too. Also, graphs would be a pleasant addition.

@ibmua
Copy link
Contributor

ibmua commented Jul 29, 2016

It's interesting to find out how ResNets perform for different activations. PReLU was coauthored by Kaiming He - coauthor of ResNets, but for some reason he used ReLUs in ResNets.

@ducha-aiki
Copy link
Owner Author

CReLU is on the way https://arxiv.org/pdf/1603.05201.pdf

@wangg12
Copy link

wangg12 commented Jun 19, 2017

How about SELU @ducha-aiki ?

@ducha-aiki
Copy link
Owner Author

@simbaforrest
Copy link

@ducha-aiki In the SELU paper they seem to suggest using MSRA-like initialization for weights, while your prototxt seem to use fixed Gaussian initialization? Could this affect the result?

@ducha-aiki
Copy link
Owner Author

@simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :)
What is in prototxt, does not matter, because LSUV should be run as separate script. LSUV uses data to adjust weights values, so if MSRA is optimal, then LSUV will output smth similar to MSRA.

@simbaforrest
Copy link

simbaforrest commented Aug 21, 2017 via email

@ducha-aiki
Copy link
Owner Author

Well, it is still possible that SELU is a very architecture dependent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants