Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hardcoded flint_mpn_aors_n for ARM and x86 #2118

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

albinahlback
Copy link
Collaborator

These are generated from dev/gen_ARCH_aors.jl. Also add tests for it. I haven't tested this on ARM machines yet.

The x86 code should probably not reside within Broadwell since this is Nehalem compatible, but basically all x86 processors today are Broadwell compatible.

@fredrik-johansson

These are generated from `dev/gen_ARCH_aors.jl`.  Also add tests for it.
@fredrik-johansson
Copy link
Collaborator

Quick timings:

   n  mpn_add_n  flint_      speedup
   1  2.300e-09  1.810e-09   1.271
   2  2.510e-09  1.810e-09   1.387
   3  2.960e-09  2.040e-09   1.451
   4  2.760e-09  2.050e-09   1.346
   5  3.010e-09  2.080e-09   1.447
   6  3.280e-09  2.120e-09   1.547
   7  3.710e-09  2.350e-09   1.579
   8  3.330e-09  2.180e-09   1.528
   9  3.600e-09  2.430e-09   1.481
  10  3.840e-09  2.630e-09   1.460
  11  4.300e-09  2.850e-09   1.509
  12  3.860e-09  3.060e-09   1.261
  13  4.130e-09  3.360e-09   1.229
  14  4.600e-09  3.590e-09   1.281
  15  5.050e-09  3.780e-09   1.336
  16  4.890e-09  4.020e-09   1.216
  17  5.320e-09  5.530e-09   0.962
  18  5.690e-09  5.500e-09   1.035
  19  6.210e-09  6.320e-09   0.983
  20  5.320e-09  5.270e-09   1.009
  21  5.540e-09  5.570e-09   0.995
  22  5.840e-09  5.900e-09   0.990
  23  6.120e-09  6.140e-09   0.997
  24  6.320e-09  6.340e-09   0.997
  25  6.520e-09  6.560e-09   0.994
  26  6.850e-09  6.830e-09   1.003
  27  6.990e-09  6.960e-09   1.004
  28  7.250e-09  7.270e-09   0.997
  29  7.550e-09  7.570e-09   0.997
  30  7.700e-09  7.690e-09   1.001
  31  7.940e-09  7.980e-09   0.995
  32  8.180e-09  8.160e-09   1.002

Looks worth extending a bit beyond 16.

@fredrik-johansson
Copy link
Collaborator

By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?

@albinahlback
Copy link
Collaborator Author

Quick timings:

...

Looks worth extending a bit beyond 16.

Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.

@albinahlback
Copy link
Collaborator Author

By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?

For ARM, no. For x86, it may be the case that we can use OPC reg, mem, i.e. acting directly on memory. At least this will remove one store instruction, so it's worth a try.

@fredrik-johansson
Copy link
Collaborator

Quick timings:
...
Looks worth extending a bit beyond 16.

Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.

It looks reasonable considering that the O(1) function call overhead makes up a good chunk of the time for small n.

One can also do something like this (ignoring the carry-out for now):

if (n <= 8)
{
    switch (n)
    {
        case 0: break;
        case 1: z[0] += x[0] + y[0];
        case 2: NN_ADD_2(z, x, y); break;
        case 3: NN_ADD_3(z, x, y); break;
        case 4: NN_ADD_4(z, x, y); break;
        case 5: NN_ADD_5(z, x, y); break;
        case 6: NN_ADD_6(z, x, y); break;
        case 7: NN_ADD_7(z, x, y); break;
        case 8: NN_ADD_8(z, x, y); break;
    }
}
else
{
    flint_mpn_add_n(z, x, y, n);
}

We get something like this:

   n  mpn_add_n  flint_...  inlined     speedup flint_.../inlined
   1  2.270e-09  1.830e-09  9.680e-10   1.240   1.890
   2  2.510e-09  1.830e-09  8.100e-10   1.372   2.259
   3  2.970e-09  2.050e-09  9.730e-10   1.449   2.107
   4  2.760e-09  2.080e-09  1.210e-09   1.327   1.719
   5  2.990e-09  2.100e-09  1.380e-09   1.424   1.522
   6  3.230e-09  2.140e-09  1.640e-09   1.509   1.305
   7  3.670e-09  2.360e-09  1.890e-09   1.555   1.249
   8  3.270e-09  2.330e-09  2.070e-09   1.403   1.126
   9  3.520e-09  2.500e-09  2.650e-09   1.408   0.943
  10  3.750e-09  2.660e-09  2.880e-09   1.410   0.924
  11  4.190e-09  2.940e-09  3.150e-09   1.425   0.933
  12  3.810e-09  3.170e-09  3.350e-09   1.202   0.946
  13  4.060e-09  3.380e-09  3.560e-09   1.201   0.949
  14  4.300e-09  3.620e-09  3.790e-09   1.188   0.955
  15  4.750e-09  3.860e-09  4.040e-09   1.231   0.955
  16  4.840e-09  4.060e-09  4.240e-09   1.192   0.958
  17  5.410e-09  5.150e-09  5.980e-09   1.050   0.861
  18  5.630e-09  5.680e-09  6.370e-09   0.991   0.892
  19  6.350e-09  6.320e-09  7.040e-09   1.005   0.898
  20  5.220e-09  5.180e-09  5.690e-09   1.008   0.910
  21  5.520e-09  5.560e-09  5.640e-09   0.993   0.986

Then we gain a bit for small n but lose a bit for larger n because of the extra logic.

@fredrik-johansson
Copy link
Collaborator

fredrik-johansson commented Nov 28, 2024

BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.

@albinahlback
Copy link
Collaborator Author

BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.

Yes, but I recall Clang being sort of good at this, at least with instrinsics.

@fredrik-johansson
Copy link
Collaborator

Using flint_mpn_aors_n instead of mpn_aors_n in mpn_mod_mat_mul seems to give a slight slowdown. However, there seems to be a very slight (~1%) speedup specializing the inner loops to call flint_mpn_aors_N directly for various N. I guess one needs to find more addition-heavy workloads for this to really pay off.

@albinahlback
Copy link
Collaborator Author

Keep in mind that the speedups for addition is not too significant in itself. The multiplication is a more important routine to optimize since its heavy, where optimizing it can lead to more time saved in absolute terms.

@albinahlback
Copy link
Collaborator Author

Note to self: If we want n > 16, we need to shift ap, bp and rp so that we do not get larger instructions. With offsets larger than 127, the instruction size increases which, I think, would negatively impact the performance of these functions (at least it would increase the risk of cache misses when using these functions).

@albinahlback
Copy link
Collaborator Author

This will fail on ARM, but now aorsrsh is implemented for x86.

@fredrik-johansson
Copy link
Collaborator

I believe using this flint_mpn_add_n in actual code slows things down because the penalty of a mispredicted indirect jump (and maybe a predicted one too?) outweighs the speedup of the addition itself.

So maybe there shouldn't be an flint_mpn_add_n, but there could still be flint_mpn_add_12 for use in routines with hardcoded lengths; we could define such fixed-length functions on all architectures by having a macro for mpn_add_n(..., 12) as the fallback (we should actually do this for multiplication routines too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants