v3.20.3

JayDDee · Oct 22, 2022 · bd84f19 · bd84f19
1 parent 58030e2
commit bd84f19
Show file tree

Hide file tree

Showing 35 changed files with 1,005 additions and 4,960 deletions.
diff --git a/Makefile.am b/Makefile.am
@@ -205,7 +205,6 @@ cpuminer_SOURCES = \
   algo/verthash/tiny_sha3/sha3.c \
   algo/verthash/tiny_sha3/sha3-4way.c \
   algo/whirlpool/sph_whirlpool.c \
-  algo/whirlpool/whirlpool-hash-4way.c \
   algo/whirlpool/whirlpool-gate.c \
   algo/whirlpool/whirlpool.c \
   algo/whirlpool/whirlpoolx.c \

diff --git a/README.md b/README.md
@@ -40,17 +40,25 @@ Requirements
 Intel Core2 and newer and AMD equivalents. Further optimizations are available
 on some algoritms for CPUs with AES, AVX, AVX2, SHA, AVX512 and VAES.
 
-Older CPUs are supported by cpuminer-multi by TPruvot but at reduced
-performance.
+32 bit CPUs are not supported.
+Other CPU architectures such as ARM, Raspberry Pi, RISC-V, Xeon Phi, etc,
+are not supported.
 
-ARM and Aarch64 CPUs are not supported.
+Mobile CPUs like laptop computers are not recommended because they aren't
+designed for extreme heat of operating at full load for extended periods of
+time.
+
+Older CPUs and ARM architecture may be supported by cpuminer-multi by TPruvot.
 
 2. 64 bit Linux or Windows OS. Ubuntu and Fedora based distributions,
 including Mint and Centos, are known to work and have all dependencies
 in their repositories. Others may work but may require more effort. Older
 versions such as Centos 6 don't work due to missing features. 
-64 bit Windows OS is supported with mingw_w64 and msys or pre-built binaries.
 
+Windows 7 or newer is supported with mingw_w64 and msys or using the pre-built
+binaries. WindowsXP 64 bit is YMMV.
+
+FreeBSD is not actively tested but should work, YMMV.
 MacOS, OSx and Android are not supported.
 
 3. Stratum pool supporting stratum+tcp:// or stratum+ssl:// protocols or

diff --git a/README.txt b/README.txt
@@ -1,12 +1,22 @@
 This file is included in the Windows binary package. Compile instructions
 for Linux and Windows can be found in RELEASE_NOTES.
 
-This package is officially avalable only from:
+cpuminer-opt is open source and free of any fees. Many forks exist that are
+closed source and contain usage fees. support open source free software.
+
+This package is officially avalaible only from:
+
  https://github.com/JayDDee/cpuminer-opt
+
 No other sources should be trusted.
 
 cpuminer is a console program that is executed from a DOS or Powershell
-prompt. There is no GUI and no mouse support.
+command prompt. There is no GUI and no mouse support.
+
+New users are encouraged to consult the cpuminer-opt Wiki for detailed
+information on usage:
+
+https://github.com/JayDDee/cpuminer-opt/wiki
 
 Miner programs are often flagged as malware by antivirus programs. This is
 a false positive, they are flagged simply because they are cryptocurrency 
@@ -43,12 +53,11 @@ cpuminer-avx2.exe              Haswell, Skylake, Kabylake, Coffeelake, Cometlake
 cpuminer-avx2-sha.exe          AMD Zen1, Zen2
 cpuminer-avx2-sha-vaes.exe     Intel Alderlake*, AMD Zen3
 cpuminer-avx512.exe            Intel HEDT Skylake-X, Cascadelake
-cpuminer-avx512-sha-vaes.exe   Icelake, Tigerlake, Rocketlake
+cpuminer-avx512-sha-vaes.exe   AMD Zen4, Intel Rocketlake, Icelake
 
-* Alderlake is a hybrid architecture. With the E-cores disabled it may be
-  possible to enable AVX512 on the the P-cores and use the avx512-sha-vaes
-  build. This is not officially supported by Intel at time of writing.
-  Check for current information.
+* Alderlake is a hybrid architecture with a mix of E-cores & P-cores. Although
+  the P-cores can support AVX512 the E-cores can't so Intel decided to disable
+  AVX512 on the the P-cores.
 
 Notes about included DLL files:
 
@@ -59,9 +68,11 @@ source code obtained from the author's official repository. The exact
 procedure is documented in the build instructions for Windows:
 https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source
 
-Some DLL filess may already be installed on the system by Windows or third
-party packages. They often will work and may be used instead of the included
-file. 
+Some included DLL files may already be installed on the system by Windows or
+third party packages. They often will work and may be used instead of the
+included version of the files.
+
+
 
 If you like this software feel free to donate:
 

diff --git a/RELEASE_NOTES b/RELEASE_NOTES
@@ -65,6 +65,12 @@ If not what makes it happen or not happen?
 Change Log
 ----------
 
+v3.20.3
+
+Faster c11 algo: AVX512 6%, AVX2 4%, AVX2+VAES 15%.
+Faster AVX2+VAES for anime 14%, hmq1725 6%.
+Small optimizations to Luffa AVX2 & AVX512.
+
 v3.20.2
 
 Bit rotation optimizations to Blake256, Blake512, Blake2b, Blake2s & Lyra2-blake2b for SSE2 & AVX2.
@@ -75,7 +81,7 @@ v3.20.1
 sph_blake2b optimized 1-way SSSE3 & AVX2.
 Removed duplicate Blake2b used by Power2b algo, will now use optimized sph_blake2b.
 Removed imprecise hash & target display from rejected share log.
-Share and target difficulty is now displayed only for low diificulty shares.
+Share and target difficulty is now displayed only for low difficulty shares.
 Updated configure.ac to check for AVX512 asm support.
 Small optimization to Lyra2 SSE2.
 

diff --git a/algo-gate-api.c b/algo-gate-api.c
@@ -67,7 +67,6 @@ void do_nothing   () {}
 bool return_true  () { return true;  }
 bool return_false () { return false; }
 void *return_null () { return NULL;  }
-void call_error   () { printf("ERR: Uninitialized function pointer\n"); }
 
 void algo_not_tested()
 {
@@ -95,7 +94,8 @@ int null_scanhash()
    return 0;
 }
 
-// Default generic scanhash can be used in many cases.
+// Default generic scanhash can be used in many cases. Not to be used when
+// prehashing can be done or when byte swapping the data can be avoided.
 int scanhash_generic( struct work *work, uint32_t max_nonce,
                       uint64_t *hashes_done, struct thr_info *mythr )
 {
@@ -152,6 +152,9 @@ int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
    const bool bench = opt_benchmark;
 
    mm256_bswap32_intrlv80_4x64( vdata, pdata );
+   // overwrite byte swapped nonce with original byte order for proper
+   // incrementing. The nonce only needs to byte swapped if it is to be
+   // sumbitted.
    *noncev = mm256_intrlv_blend_32(
                    _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
    do

diff --git a/algo/blake/blake256-hash-4way.c b/algo/blake/blake256-hash-4way.c
@@ -316,15 +316,15 @@ static const sph_u32 CS[16] = {
                                           CSx( r, 5 ) ^ Mx( r, 4 ), \
                                           CSx( r, 3 ) ^ Mx( r, 2 ), \
                                           CSx( r, 1 ) ^ Mx( r, 0 ) ) ) ); \
-   V3 = mm128_ror_32( _mm_xor_si128( V3, V0 ), 16 ); \
+   V3 = mm128_swap32_16( _mm_xor_si128( V3, V0 ) ); \
    V2 = _mm_add_epi32( V2, V3 ); \
    V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 12 ); \
    V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
                            _mm_set_epi32( CSx( r, 6 ) ^ Mx( r, 7 ), \
                                           CSx( r, 4 ) ^ Mx( r, 5 ), \
                                           CSx( r, 2 ) ^ Mx( r, 3 ), \
                                           CSx( r, 0 ) ^ Mx( r, 1 ) ) ) ); \
-   V3 = mm128_ror_32( _mm_xor_si128( V3, V0 ), 8 ); \
+   V3 = mm128_shuflr32_8( _mm_xor_si128( V3, V0 ) ); \
    V2 = _mm_add_epi32( V2, V3 ); \
    V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 7 ); \
    V3 = mm128_shufll_32( V3 ); \
@@ -335,15 +335,15 @@ static const sph_u32 CS[16] = {
                                           CSx( r, D ) ^ Mx( r, C ), \
                                           CSx( r, B ) ^ Mx( r, A ), \
                                           CSx( r, 9 ) ^ Mx( r, 8 ) ) ) ); \
-   V3 = mm128_ror_32( _mm_xor_si128( V3, V0 ), 16 ); \
+   V3 = mm128_swap32_16( _mm_xor_si128( V3, V0 ) ); \
    V2 = _mm_add_epi32( V2, V3 ); \
    V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 12 ); \
    V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
                            _mm_set_epi32( CSx( r, E ) ^ Mx( r, F ), \
                                           CSx( r, C ) ^ Mx( r, D ), \
                                           CSx( r, A ) ^ Mx( r, B ), \
                                           CSx( r, 8 ) ^ Mx( r, 9 ) ) ) ); \
-   V3 = mm128_ror_32( _mm_xor_si128( V3, V0 ), 8 ); \
+   V3 = mm128_shuflr32_8( _mm_xor_si128( V3, V0 ) ); \
    V2 = _mm_add_epi32( V2, V3 ); \
    V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 7 ); \
    V3 = mm128_shuflr_32( V3 ); \

diff --git a/algo/blake/sph_blake2b.c b/algo/blake/sph_blake2b.c
@@ -78,7 +78,8 @@
   V[1] = mm256_shufll_64( V[1] ); \
 }
 
-#elif defined(__SSSE3__)
+#elif defined(__SSE2__)
+// always true
 
 #define BLAKE2B_G( Va, Vb, Vc, Vd, Sa, Sb, Sc, Sd ) \
 { \
@@ -115,6 +116,7 @@
 }
 
 #else
+// never used, SSE2 is always available
 
 #ifndef ROTR64
 #define ROTR64(x, y)  (((x) >> (y)) ^ ((x) << (64 - (y))))

diff --git a/algo/bmw/bmw512-hash-4way.c b/algo/bmw/bmw512-hash-4way.c
@@ -747,38 +747,40 @@ void compress_big( const __m256i *M, const __m256i H[16], __m256i dH[16] )
    mj[14] = mm256_rol_64( M[14], 15 );
    mj[15] = mm256_rol_64( M[15], 16 );
 
-   qt[16] = add_elt_b( mj[ 0], mj[ 3], mj[10], H[ 7],
-              (const __m256i)_mm256_set1_epi64x( 16 * 0x0555555555555555ULL ) );
-   qt[17] = add_elt_b( mj[ 1], mj[ 4], mj[11], H[ 8],
-              (const __m256i)_mm256_set1_epi64x( 17 * 0x0555555555555555ULL ) );
-   qt[18] = add_elt_b( mj[ 2], mj[ 5], mj[12], H[ 9],
-              (const __m256i)_mm256_set1_epi64x( 18 * 0x0555555555555555ULL ) );
-   qt[19] = add_elt_b( mj[ 3], mj[ 6], mj[13], H[10],
-              (const __m256i)_mm256_set1_epi64x( 19 * 0x0555555555555555ULL ) );
-   qt[20] = add_elt_b( mj[ 4], mj[ 7], mj[14], H[11],
-              (const __m256i)_mm256_set1_epi64x( 20 * 0x0555555555555555ULL ) );
-   qt[21] = add_elt_b( mj[ 5], mj[ 8], mj[15], H[12],
-              (const __m256i)_mm256_set1_epi64x( 21 * 0x0555555555555555ULL ) );
-   qt[22] = add_elt_b( mj[ 6], mj[ 9], mj[ 0], H[13],
-              (const __m256i)_mm256_set1_epi64x( 22 * 0x0555555555555555ULL ) );
-   qt[23] = add_elt_b( mj[ 7], mj[10], mj[ 1], H[14],
-              (const __m256i)_mm256_set1_epi64x( 23 * 0x0555555555555555ULL ) );
-   qt[24] = add_elt_b( mj[ 8], mj[11], mj[ 2], H[15],
-              (const __m256i)_mm256_set1_epi64x( 24 * 0x0555555555555555ULL ) );
-   qt[25] = add_elt_b( mj[ 9], mj[12], mj[ 3], H[ 0],
-              (const __m256i)_mm256_set1_epi64x( 25 * 0x0555555555555555ULL ) );
-   qt[26] = add_elt_b( mj[10], mj[13], mj[ 4], H[ 1],
-              (const __m256i)_mm256_set1_epi64x( 26 * 0x0555555555555555ULL ) );
-   qt[27] = add_elt_b( mj[11], mj[14], mj[ 5], H[ 2],
-              (const __m256i)_mm256_set1_epi64x( 27 * 0x0555555555555555ULL ) );
-   qt[28] = add_elt_b( mj[12], mj[15], mj[ 6], H[ 3],
-              (const __m256i)_mm256_set1_epi64x( 28 * 0x0555555555555555ULL ) );
-   qt[29] = add_elt_b( mj[13], mj[ 0], mj[ 7], H[ 4],
-              (const __m256i)_mm256_set1_epi64x( 29 * 0x0555555555555555ULL ) );
-   qt[30] = add_elt_b( mj[14], mj[ 1], mj[ 8], H[ 5],
-              (const __m256i)_mm256_set1_epi64x( 30 * 0x0555555555555555ULL ) );
-   qt[31] = add_elt_b( mj[15], mj[ 2], mj[ 9], H[ 6],
-              (const __m256i)_mm256_set1_epi64x( 31 * 0x0555555555555555ULL ) );
+   __m256i K = _mm256_set1_epi64x( 16 * 0x0555555555555555ULL );
+   const __m256i Kincr = _mm256_set1_epi64x( 0x0555555555555555ULL );
+
+   qt[16] = add_elt_b( mj[ 0], mj[ 3], mj[10], H[ 7], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[17] = add_elt_b( mj[ 1], mj[ 4], mj[11], H[ 8], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[18] = add_elt_b( mj[ 2], mj[ 5], mj[12], H[ 9], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[19] = add_elt_b( mj[ 3], mj[ 6], mj[13], H[10], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[20] = add_elt_b( mj[ 4], mj[ 7], mj[14], H[11], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[21] = add_elt_b( mj[ 5], mj[ 8], mj[15], H[12], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[22] = add_elt_b( mj[ 6], mj[ 9], mj[ 0], H[13], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[23] = add_elt_b( mj[ 7], mj[10], mj[ 1], H[14], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[24] = add_elt_b( mj[ 8], mj[11], mj[ 2], H[15], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[25] = add_elt_b( mj[ 9], mj[12], mj[ 3], H[ 0], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[26] = add_elt_b( mj[10], mj[13], mj[ 4], H[ 1], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[27] = add_elt_b( mj[11], mj[14], mj[ 5], H[ 2], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[28] = add_elt_b( mj[12], mj[15], mj[ 6], H[ 3], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[29] = add_elt_b( mj[13], mj[ 0], mj[ 7], H[ 4], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[30] = add_elt_b( mj[14], mj[ 1], mj[ 8], H[ 5], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[31] = add_elt_b( mj[15], mj[ 2], mj[ 9], H[ 6], K );
 
    qt[16] = _mm256_add_epi64( qt[16], expand1_b( qt, 16 ) );
    qt[17] = _mm256_add_epi64( qt[17], expand1_b( qt, 17 ) );
@@ -1180,7 +1182,6 @@ void compress_big_8way( const __m512i *M, const __m512i H[16],
    qt[15] = _mm512_add_epi64( s8b0( W8b15), H[ 0] );
 
    __m512i mj[16];
-   uint64_t K = 16 * 0x0555555555555555ULL;
 
    mj[ 0] = mm512_rol_64( M[ 0],  1 );
    mj[ 1] = mm512_rol_64( M[ 1],  2 );
@@ -1199,54 +1200,40 @@ void compress_big_8way( const __m512i *M, const __m512i H[16],
    mj[14] = mm512_rol_64( M[14], 15 );
    mj[15] = mm512_rol_64( M[15], 16 );
 
-   qt[16] = add_elt_b8( mj[ 0], mj[ 3], mj[10], H[ 7],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[17] = add_elt_b8( mj[ 1], mj[ 4], mj[11], H[ 8],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[18] = add_elt_b8( mj[ 2], mj[ 5], mj[12], H[ 9],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[19] = add_elt_b8( mj[ 3], mj[ 6], mj[13], H[10],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[20] = add_elt_b8( mj[ 4], mj[ 7], mj[14], H[11],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[21] = add_elt_b8( mj[ 5], mj[ 8], mj[15], H[12],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[22] = add_elt_b8( mj[ 6], mj[ 9], mj[ 0], H[13],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[23] = add_elt_b8( mj[ 7], mj[10], mj[ 1], H[14],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[24] = add_elt_b8( mj[ 8], mj[11], mj[ 2], H[15],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[25] = add_elt_b8( mj[ 9], mj[12], mj[ 3], H[ 0],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[26] = add_elt_b8( mj[10], mj[13], mj[ 4], H[ 1],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[27] = add_elt_b8( mj[11], mj[14], mj[ 5], H[ 2],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[28] = add_elt_b8( mj[12], mj[15], mj[ 6], H[ 3],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[29] = add_elt_b8( mj[13], mj[ 0], mj[ 7], H[ 4],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[30] = add_elt_b8( mj[14], mj[ 1], mj[ 8], H[ 5],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-   K += 0x0555555555555555ULL;
-   qt[31] = add_elt_b8( mj[15], mj[ 2], mj[ 9], H[ 6],
-                        (const __m512i)_mm512_set1_epi64( K ) );
-
+   __m512i K = _mm512_set1_epi64( 16 * 0x0555555555555555ULL );
+   const __m512i Kincr = _mm512_set1_epi64( 0x0555555555555555ULL );
+
+   qt[16] = add_elt_b8( mj[ 0], mj[ 3], mj[10], H[ 7], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[17] = add_elt_b8( mj[ 1], mj[ 4], mj[11], H[ 8], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[18] = add_elt_b8( mj[ 2], mj[ 5], mj[12], H[ 9], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[19] = add_elt_b8( mj[ 3], mj[ 6], mj[13], H[10], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[20] = add_elt_b8( mj[ 4], mj[ 7], mj[14], H[11], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[21] = add_elt_b8( mj[ 5], mj[ 8], mj[15], H[12], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[22] = add_elt_b8( mj[ 6], mj[ 9], mj[ 0], H[13], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[23] = add_elt_b8( mj[ 7], mj[10], mj[ 1], H[14], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[24] = add_elt_b8( mj[ 8], mj[11], mj[ 2], H[15], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[25] = add_elt_b8( mj[ 9], mj[12], mj[ 3], H[ 0], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[26] = add_elt_b8( mj[10], mj[13], mj[ 4], H[ 1], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[27] = add_elt_b8( mj[11], mj[14], mj[ 5], H[ 2], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[28] = add_elt_b8( mj[12], mj[15], mj[ 6], H[ 3], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[29] = add_elt_b8( mj[13], mj[ 0], mj[ 7], H[ 4], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[30] = add_elt_b8( mj[14], mj[ 1], mj[ 8], H[ 5], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[31] = add_elt_b8( mj[15], mj[ 2], mj[ 9], H[ 6], K );
 
    qt[16] = _mm512_add_epi64( qt[16], expand1_b8( qt, 16 ) );
    qt[17] = _mm512_add_epi64( qt[17], expand1_b8( qt, 17 ) );