-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient decent in C? #35
Comments
In a word: fusion; or rather, lack of. I had a version using hmatrix, but profiling showed it was taking up a large proportion of the runtime. I believe it was because it couldn't unroll the loops and work on one value at a time. The C rewrite was a good deal faster, and I have a benchmark on it in the suite (thought can't remember the speed up right now). HBLAS might do it better, but again it's mostly a fusion issue. One might do better as well trying to aggressively use SIMD. |
Ah, ok. I'm about this close (holds fingers close together) to trying to make an accelerate backend/branch/fork (but I'm not sure how much work that would take) to get fusion/gpu/simd for "free". Is that something you'd be interested, if I could make it work? (Unrelatedly, I've also got some outstanding changes to make various things instances of |
I would be interested (especially if there's benchmarks). In grenade, for most networks, most of the run time is matrix-matrix multiplications, which is pretty much what you want. I know CUDA/cuDNN would be faster, but I'm not sure how well accelerate does the tasks we need. If you're using LSTMs, probably the one thing which would get the biggest easy improvement would be proper minibatching. Matrix-matrix multiplications with BLAS are far more efficient that As for I added the Thanks for the issue :) |
So, I've started poking at an |
I'm at ICML at the moment, and have spoken with a few people who are interested in helping out in this effort. I might also talk to Trevor (who wrote accelerate) next meetup to see if he has any advice. |
Neat. I'm happy to put what I have so far up on a branch... It's a bit fragmented so far, but as my first stab, I'm trying to replicate im2col in order to test out the benchmarks. My main dev laptop isn't CUDA-friendly, so I won't be able to test the upper limits. Also, I suspect a bunch of the improvement will be once you're actually stacking multiple layers together and the fusion starts kicking in. In the project that's motivating all of this work, I've notice that the garbage collector is quick active in general. |
If there's anything I can do to help with an accelerate back end let me know. I was about to take a look myself. |
I chatted with Trevor today, and he is also interested in getting this working. |
Just noticed this, figured it should be linked from here since it seems relevant: #38 |
Why did you choose to write the gradient descent code in C, rather than using the library you used for the other matrix computations? Would you get a speedup by doing the descent in hblas?
The text was updated successfully, but these errors were encountered: