Add hardcoded flint_mpn_aors_n for ARM and x86 #2118

albinahlback · 2024-11-28T01:24:01Z

These are generated from dev/gen_ARCH_aors.jl. Also add tests for it. I haven't tested this on ARM machines yet.

The x86 code should probably not reside within Broadwell since this is Nehalem compatible, but basically all x86 processors today are Broadwell compatible.

@fredrik-johansson

These are generated from `dev/gen_ARCH_aors.jl`. Also add tests for it.

fredrik-johansson · 2024-11-28T09:25:16Z

Quick timings:

   n  mpn_add_n  flint_      speedup
   1  2.300e-09  1.810e-09   1.271
   2  2.510e-09  1.810e-09   1.387
   3  2.960e-09  2.040e-09   1.451
   4  2.760e-09  2.050e-09   1.346
   5  3.010e-09  2.080e-09   1.447
   6  3.280e-09  2.120e-09   1.547
   7  3.710e-09  2.350e-09   1.579
   8  3.330e-09  2.180e-09   1.528
   9  3.600e-09  2.430e-09   1.481
  10  3.840e-09  2.630e-09   1.460
  11  4.300e-09  2.850e-09   1.509
  12  3.860e-09  3.060e-09   1.261
  13  4.130e-09  3.360e-09   1.229
  14  4.600e-09  3.590e-09   1.281
  15  5.050e-09  3.780e-09   1.336
  16  4.890e-09  4.020e-09   1.216
  17  5.320e-09  5.530e-09   0.962
  18  5.690e-09  5.500e-09   1.035
  19  6.210e-09  6.320e-09   0.983
  20  5.320e-09  5.270e-09   1.009
  21  5.540e-09  5.570e-09   0.995
  22  5.840e-09  5.900e-09   0.990
  23  6.120e-09  6.140e-09   0.997
  24  6.320e-09  6.340e-09   0.997
  25  6.520e-09  6.560e-09   0.994
  26  6.850e-09  6.830e-09   1.003
  27  6.990e-09  6.960e-09   1.004
  28  7.250e-09  7.270e-09   0.997
  29  7.550e-09  7.570e-09   0.997
  30  7.700e-09  7.690e-09   1.001
  31  7.940e-09  7.980e-09   0.995
  32  8.180e-09  8.160e-09   1.002

Looks worth extending a bit beyond 16.

fredrik-johansson · 2024-11-28T09:30:40Z

By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?

albinahlback · 2024-11-28T09:42:47Z

Quick timings:

...

Looks worth extending a bit beyond 16.

Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.

albinahlback · 2024-11-28T09:44:31Z

By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?

For ARM, no. For x86, it may be the case that we can use OPC reg, mem, i.e. acting directly on memory. At least this will remove one store instruction, so it's worth a try.

fredrik-johansson · 2024-11-28T10:25:16Z

Quick timings:
...
Looks worth extending a bit beyond 16.

Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.

It looks reasonable considering that the O(1) function call overhead makes up a good chunk of the time for small n.

One can also do something like this (ignoring the carry-out for now):

if (n <= 8)
{
    switch (n)
    {
        case 0: break;
        case 1: z[0] += x[0] + y[0];
        case 2: NN_ADD_2(z, x, y); break;
        case 3: NN_ADD_3(z, x, y); break;
        case 4: NN_ADD_4(z, x, y); break;
        case 5: NN_ADD_5(z, x, y); break;
        case 6: NN_ADD_6(z, x, y); break;
        case 7: NN_ADD_7(z, x, y); break;
        case 8: NN_ADD_8(z, x, y); break;
    }
}
else
{
    flint_mpn_add_n(z, x, y, n);
}

We get something like this:

   n  mpn_add_n  flint_...  inlined     speedup flint_.../inlined
   1  2.270e-09  1.830e-09  9.680e-10   1.240   1.890
   2  2.510e-09  1.830e-09  8.100e-10   1.372   2.259
   3  2.970e-09  2.050e-09  9.730e-10   1.449   2.107
   4  2.760e-09  2.080e-09  1.210e-09   1.327   1.719
   5  2.990e-09  2.100e-09  1.380e-09   1.424   1.522
   6  3.230e-09  2.140e-09  1.640e-09   1.509   1.305
   7  3.670e-09  2.360e-09  1.890e-09   1.555   1.249
   8  3.270e-09  2.330e-09  2.070e-09   1.403   1.126
   9  3.520e-09  2.500e-09  2.650e-09   1.408   0.943
  10  3.750e-09  2.660e-09  2.880e-09   1.410   0.924
  11  4.190e-09  2.940e-09  3.150e-09   1.425   0.933
  12  3.810e-09  3.170e-09  3.350e-09   1.202   0.946
  13  4.060e-09  3.380e-09  3.560e-09   1.201   0.949
  14  4.300e-09  3.620e-09  3.790e-09   1.188   0.955
  15  4.750e-09  3.860e-09  4.040e-09   1.231   0.955
  16  4.840e-09  4.060e-09  4.240e-09   1.192   0.958
  17  5.410e-09  5.150e-09  5.980e-09   1.050   0.861
  18  5.630e-09  5.680e-09  6.370e-09   0.991   0.892
  19  6.350e-09  6.320e-09  7.040e-09   1.005   0.898
  20  5.220e-09  5.180e-09  5.690e-09   1.008   0.910
  21  5.520e-09  5.560e-09  5.640e-09   0.993   0.986

Then we gain a bit for small n but lose a bit for larger n because of the extra logic.

fredrik-johansson · 2024-11-28T10:35:48Z

BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.

albinahlback · 2024-11-28T11:03:41Z

BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.

Yes, but I recall Clang being sort of good at this, at least with instrinsics.

fredrik-johansson · 2024-11-29T15:39:56Z

Using flint_mpn_aors_n instead of mpn_aors_n in mpn_mod_mat_mul seems to give a slight slowdown. However, there seems to be a very slight (~1%) speedup specializing the inner loops to call flint_mpn_aors_N directly for various N. I guess one needs to find more addition-heavy workloads for this to really pay off.

albinahlback · 2024-11-29T15:49:02Z

Keep in mind that the speedups for addition is not too significant in itself. The multiplication is a more important routine to optimize since its heavy, where optimizing it can lead to more time saved in absolute terms.

albinahlback · 2024-12-02T10:57:26Z

Note to self: If we want n > 16, we need to shift ap, bp and rp so that we do not get larger instructions. With offsets larger than 127, the instruction size increases which, I think, would negatively impact the performance of these functions (at least it would increase the risk of cache misses when using these functions).

albinahlback · 2024-12-02T19:26:26Z

This will fail on ARM, but now aorsrsh is implemented for x86.

fredrik-johansson · 2024-12-13T15:41:40Z

I believe using this flint_mpn_add_n in actual code slows things down because the penalty of a mispredicted indirect jump (and maybe a predicted one too?) outweighs the speedup of the addition itself.

So maybe there shouldn't be an flint_mpn_add_n, but there could still be flint_mpn_add_12 for use in routines with hardcoded lengths; we could define such fixed-length functions on all architectures by having a macro for mpn_add_n(..., 12) as the fallback (we should actually do this for multiplication routines too).

Add hardcoded flint_mpn_aors_n for ARM and x86

d09c98e

These are generated from `dev/gen_ARCH_aors.jl`. Also add tests for it.

Start on aorsrsh for x86

00ecdcb

Albin Ahlbäck added 8 commits December 2, 2024 14:08

Add generation for hardcoded flint_mpn_addrsh

9ce9168

Also subrsh

c806930

Fixup

62a6c20

fixup

e63d4f7

Add corresponding C sources for aorsrsh

2daf43a

Add tests for aorsrsh

96786cd

blabla

65fadb4

bla

877ec63

Albin Ahlbäck added 3 commits December 2, 2024 19:28

fixup

a537b89

fixup

55dd138

Stash for aorsrsh for ARM64

9e7f462

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hardcoded flint_mpn_aors_n for ARM and x86 #2118

Add hardcoded flint_mpn_aors_n for ARM and x86 #2118

albinahlback commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024

albinahlback commented Nov 28, 2024

albinahlback commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024 •

edited

Loading

albinahlback commented Nov 28, 2024

fredrik-johansson commented Nov 29, 2024

albinahlback commented Nov 29, 2024

albinahlback commented Dec 2, 2024

albinahlback commented Dec 2, 2024

fredrik-johansson commented Dec 13, 2024

Add hardcoded flint_mpn_aors_n for ARM and x86 #2118

Are you sure you want to change the base?

Add hardcoded flint_mpn_aors_n for ARM and x86 #2118

Conversation

albinahlback commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024

albinahlback commented Nov 28, 2024

albinahlback commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024

fredrik-johansson commented Nov 28, 2024 • edited Loading

albinahlback commented Nov 28, 2024

fredrik-johansson commented Nov 29, 2024

albinahlback commented Nov 29, 2024

albinahlback commented Dec 2, 2024

albinahlback commented Dec 2, 2024

fredrik-johansson commented Dec 13, 2024

fredrik-johansson commented Nov 28, 2024 •

edited

Loading