-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What I will test next #1
Comments
Is this worth testing? "Deep Learning with S-shaped Rectified Linear Activation Units" |
Well, I don`t like activations which could not be done "in-place", but if SReLU will be implemented in caffe, I test it. |
I learn a lot from your this caffenet-benchmark, is there anything I can try to speedup the benchmarks? |
@Darwin2011 |
Take a look at https://github.com/KaimingHe/deep-residual-networks/ |
@bhack thanks, as usual ;) |
@bhack yes, I have contacted that guy :) Now testing ResNet101 (as in paper, + without scale/bias, + without BN) without last block, because activation size is too small for it. |
@ducha-aiki, Did you had any look with the training of ResNet on caffe? Thanks |
@1adrianb there is even one successful attempt here https://github.com/ducha-aiki/caffenet-benchmark/blob/master/ResNets.md |
@ducha-aiki thanks a lot for your prompt answer and for the recommended paper. |
@1adrianb |
@ducha-aiki makes sense :) Thanks again for your help! |
Now vgg16 with all tricks is in training (to check if they for weaker models only) |
Hi, have you tried pre-activation resnet described in paper http://arxiv.org/abs/1603.05027? And what about Inception-ResNet-v2 described in paper http://arxiv.org/abs/1602.07261? |
@wangxianliang |
It's a good idea to test influence of LSUV init time batch sizes on large networks with highly variant data. It seems in the paper that you've only tested this for some tanh tiny Cifar-10 net with some relatively tiny batches and even then a positive correlation between a batch size and performance could be seen. It seems obvious, that it is of much greater influence when it comes to large networks and largely variable data with, say, 1000s of classification outputs. For example, I'm currently training a net for classifying 33.5k mouse-painted symbols. That Cifar-10 data tells me nothing. Testing 6 different batches on my data: One of the reasons why testing batch influence is so important is that if it really improves things, we can switch to computing LSUV variance not via mini-batches, but via iterations. And use, say, 100k batch rather than 1000. |
I have used tanh, because LSUV worked the worst for it. My experience with ImageNet confirms, that any batch size > 32 is OK for LSUV, if data is shuffled. Your data looks imbalanced or not shuffled. |
Yes, it actually is not very shuffled. Which means that I have to use a larger batch here to get something more like I would get with a smaller one if it was shuffled. The problem with benchmarking is that I don't have the required stuff setup on my computer and it would take me several hours to do that. |
It would be great if you noted time it took to converge different stuff. Both epochs and actual time. |
@ibmua for epochs it is very easy - 320K everywhere :) As for times - it is hard (I am lazy) because some of the trainings consist of lots of save-load with pauses between them. But I am going to do this...once :) |
Yeah, I guess that would just take import time , logging spent time to a seperate file during saves and reading during loads. Not too hard. Would give a better sense of difference between different methods. For example, training with momentum is great, but it takes more time. |
Are you really thinking I do it in python? You are underestimating my laziness ;) Everything I do is editing *.prototxt, then run caffe train --solver=current_net.prototxt > logfile.log |
Wow, okay. Guess that would take editing plot_training_log.py.example then. |
Actually, easier. Just measure a single forward-backwards pass on the side and then multiply by #epochs from the log. Having info whether the things converged or you just cut it off because it would take >320k iterations would help too. Also, graphs would be a pleasant addition. |
It's interesting to find out how ResNets perform for different activations. PReLU was coauthored by Kaiming He - coauthor of ResNets, but for some reason he used ReLUs in ResNets. |
CReLU is on the way https://arxiv.org/pdf/1603.05201.pdf |
How about SELU @ducha-aiki ? |
@wangg12 It is already there https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Activations.md |
@ducha-aiki In the SELU paper they seem to suggest using MSRA-like initialization for weights, while your prototxt seem to use fixed Gaussian initialization? Could this affect the result? |
@simbaforrest I don't use fixed initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is stated on each page of this repo :) |
Got it, so I guess selu might be only good for FNN instead of CNN? ......
2017年8月21日 02:16,"Dmytro Mishkin" <[email protected]>写道:
… @simbaforrest <https://github.com/simbaforrest> I don't use fixed
initialization, I use LSUV init https://arxiv.org/abs/1511.06422 which is
stated on each page of this repo :)
What is in prototxt, does not matter, because LSUV should be run as
separate script. LSUV uses data to adjust weights values, so if MSRA is
optimal, then LSUV will output smth similar to MSRA.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAuQpo3O3CLsqSo66oWkJtYYEtEsdKZqks5saSCggaJpZM4G8wV8>
.
|
Well, it is still possible that SELU is a very architecture dependent. |
Pooling: AVG-pooling caffenet Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree http://arxiv.org/abs/1509.08985 -- all three from paper, thanks authors for code.weight decay values , L1\L2 weights decay, dropout rates.Freeze conv structure and play with fc6-fc8 classifier. Maxout? More layers? Convolution? Inspired by http://arxiv.org/abs/1504.06066, but in end-to-end style.Solvers: default caffenet + ADAM\RMSProp\Nesterov\"poly" policyBatchNorm for blocks of layers, not each.-
For fully convolutional nets, what is better - avg pool on features, then classifier, or other way round</SqueezeNet https://github.com/DeepScale/SqueezeNetSuggestions and training logs from community are welcomed.
The text was updated successfully, but these errors were encountered: