-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hardcoded flint_mpn_aors_n for ARM and x86 #2118
base: main
Are you sure you want to change the base?
Conversation
These are generated from `dev/gen_ARCH_aors.jl`. Also add tests for it.
Quick timings:
Looks worth extending a bit beyond 16. |
By the way, we often call |
Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong. |
For ARM, no. For x86, it may be the case that we can use |
It looks reasonable considering that the O(1) function call overhead makes up a good chunk of the time for small n. One can also do something like this (ignoring the carry-out for now):
We get something like this:
Then we gain a bit for small |
BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions. |
Yes, but I recall Clang being sort of good at this, at least with instrinsics. |
Using |
Keep in mind that the speedups for addition is not too significant in itself. The multiplication is a more important routine to optimize since its heavy, where optimizing it can lead to more time saved in absolute terms. |
Note to self: If we want n > 16, we need to shift |
This will fail on ARM, but now |
I believe using this So maybe there shouldn't be an |
These are generated from
dev/gen_ARCH_aors.jl
. Also add tests for it. I haven't tested this on ARM machines yet.The x86 code should probably not reside within Broadwell since this is Nehalem compatible, but basically all x86 processors today are Broadwell compatible.
@fredrik-johansson