-
Notifications
You must be signed in to change notification settings - Fork 64
SM4性能优化
我记得好像搜索到过国内有个把SM4转换成AES的专利文章,到底是各自独立发现还是抄袭?
Go语言的对称加密实现分离了加密模式和Block级别的加密,同时支持特定加密算法实现自己的优化版本。所以我们只实现了SM4的单Block的加解密(Block接口),就能使用CBC/CFB/OFB/CTR/GCM加密模式。
关于SM4的优化实现,细粒度的并行可能性不大(bitslicing是一个方向,可以参考sm4bs),大粒度的优化实现可以参考sm4ni,可多Blocks并行加解密的模式还是比较多的。
(后续纯go语言也有持续优化)
CPU: i5-9500
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm4
BenchmarkSM4CBCEncrypt1K-6 42994 27766 ns/op 36.88 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CBCDecrypt1K-6 42690 28103 ns/op 36.44 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBEncrypt1K-6 42945 27759 ns/op 36.71 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt1K-6 42820 28493 ns/op 35.76 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt8K-6 5338 227642 ns/op 35.96 MB/s 0 B/op 0 allocs/op
BenchmarkSM4OFB1K-6 43754 27443 ns/op 37.13 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR1K-6 43292 27392 ns/op 37.20 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR8K-6 5338 220872 ns/op 37.07 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal1K-6 37594 31777 ns/op 32.22 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMOpen1K-6 37029 31919 ns/op 32.08 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMSign1K-6 315050 3882 ns/op 263.81 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMSign8K-6 43905 26876 ns/op 304.81 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMSeal8K-6 4917 250707 ns/op 32.68 MB/s 49 B/op 3 allocs/op
BenchmarkSM4GCMOpen8K-6 4722 248856 ns/op 32.92 MB/s 48 B/op 3 allocs/op
PASS
ok github.com/emmansun/gmsm/sm4 20.818s
CPU: i5-9500
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm4_test
BenchmarkSM4CBCEncrypt1K-6 73611 15995 ns/op 64.02 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CBCDecrypt1K-6 71901 15751 ns/op 65.01 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBEncrypt1K-6 73622 15952 ns/op 63.88 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt1K-6 75414 15862 ns/op 64.24 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt8K-6 10000 127258 ns/op 64.33 MB/s 0 B/op 0 allocs/op
BenchmarkSM4OFB1K-6 76830 15539 ns/op 65.58 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR1K-6 77738 15404 ns/op 66.15 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR8K-6 10000 123441 ns/op 66.32 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal1K-6 61476 19944 ns/op 51.34 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMOpen1K-6 60858 19689 ns/op 52.01 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMSign1K-6 323806 3732 ns/op 274.41 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMSign8K-6 44227 27179 ns/op 301.41 MB/s 48 B/op 3 allocs/op
BenchmarkSM4GCMSeal8K-6 7683 153646 ns/op 53.32 MB/s 49 B/op 3 allocs/op
BenchmarkSM4GCMOpen8K-6 7683 153959 ns/op 53.21 MB/s 48 B/op 3 allocs/op
PASS
ok github.com/emmansun/gmsm/sm4_test 18.863s
如果你想理解更多,可以参考SM4 with AESENCLAST。 接下来按模式进行多block并行优化。
没有写一个单独的asm函数,偷懒。
CPU: i5-9500
BenchmarkSM4CBCDecrypt1K-6 292531 4103 ns/op 249.56 MB/s 0 B/op 0 allocs/op
CPU: i5-9500
BenchmarkSM4CTR1K-6 292522 4121 ns/op 247.30 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR8K-6 36483 33203 ns/op 246.57 MB/s 0 B/op 0 allocs/op
这个先做加密并行优化,GHASH部分优化得慢慢做。
CPU: i5-9500
BenchmarkSM4GCMSeal1K-6 153688 7904 ns/op 129.56 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen1K-6 149971 7896 ns/op 129.69 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign1K-6 315027 3753 ns/op 272.85 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen8K-6 19948 60258 ns/op 135.95 MB/s 0 B/op 0 allocs/op
asm部分改造自aes的实现,优化结果很惊人!
CPU: i5-9500
BenchmarkSM4GCMSeal1K-6 273218 4491 ns/op 228.00 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen1K-6 250770 4516 ns/op 226.73 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign1K-6 3321482 359 ns/op 2853.54 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign8K-6 1000000 1014 ns/op 8079.61 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal8K-6 35432 33863 ns/op 241.92 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen8K-6 35214 33940 ns/op 241.36 MB/s 0 B/op 0 allocs/op
CMAC使用CBC模式加密来生成MAC,基于CBC模式的串行化特点以及block加密的性能,其性能必定没有GCM模式好。这里有个相关的讨论:proposal: crypto/tls: add support for AES-CCM #27484
CPU: i5-9500
goos: windows
goarch: amd64
BenchmarkAESCCMSign1K-6 1000000 1242 ns/op 824.78 MB/s 656 B/op 6 allocs/op
BenchmarkSM4CCMSign1K-6 81297 14768 ns/op 69.34 MB/s 688 B/op 6 allocs/op
BenchmarkAESCCMSeal1K-6 522488 2362 ns/op 433.44 MB/s 656 B/op 6 allocs/op
BenchmarkSM4CCMSeal1K-6 61077 19103 ns/op 53.61 MB/s 688 B/op 6 allocs/op
BenchmarkAESCCMOpen1K-6 546088 2330 ns/op 439.54 MB/s 656 B/op 6 allocs/op
BenchmarkSM4CCMOpen1K-6 64393 19265 ns/op 53.15 MB/s 688 B/op 6 allocs/op
BenchmarkAESCCMSign8K-6 160434 7566 ns/op 1082.71 MB/s 656 B/op 6 allocs/op
BenchmarkSM4CCMSign8K-6 10000 114325 ns/op 71.66 MB/s 688 B/op 6 allocs/op
BenchmarkAESCCMSeal8K-6 74269 15840 ns/op 517.17 MB/s 656 B/op 6 allocs/op
BenchmarkSM4CCMSeal8K-6 8020 143910 ns/op 56.92 MB/s 689 B/op 6 allocs/op
BenchmarkAESCCMOpen8K-6 77215 15650 ns/op 523.44 MB/s 656 B/op 6 allocs/op
BenchmarkSM4CCMOpen8K-6 8575 143358 ns/op 57.14 MB/s 688 B/op 6 allocs/op
可以看到AMD64下,sm4-ccm的性能大概是sm4-gcm的1/5。
Golang没提供这两种模式的优化接口,可能这两种模式不怎么推荐使用了,况且也就CFB解密可以并行。
XTS模式主要用于磁盘加密,不过基本没有用sm4直接作为磁盘加密算法的,最多作为CMK,用来加解密Data Key。
- The XTS-AES Tweakable Block Cipher
- AES-XTS Block Cipher Mode is used in Kingston's Encrypted USB Flash Drives
没对sm4-xts进行并发之前,
CPU: i5-9500
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm4_test
BenchmarkAES128XTSEncrypt512-6 1000000 1166 ns/op 439.08 MB/s 0 B/op 0 allocs/op
BenchmarkAES128XTSEncrypt1K-6 572972 2141 ns/op 478.18 MB/s 0 B/op 0 allocs/op
BenchmarkAES128XTSEncrypt4K-6 132927 9028 ns/op 453.71 MB/s 0 B/op 0 allocs/op
BenchmarkAES256XTSEncrypt512-6 1000000 1190 ns/op 430.24 MB/s 0 B/op 0 allocs/op
BenchmarkAES256XTSEncrypt1K-6 522496 2428 ns/op 421.79 MB/s 0 B/op 0 allocs/op
BenchmarkAES256XTSEncrypt4K-6 129376 9233 ns/op 443.63 MB/s 0 B/op 0 allocs/op
BenchmarkSM4XTSEncrypt512-6 160358 7594 ns/op 67.42 MB/s 0 B/op 0 allocs/op
BenchmarkSM4XTSEncrypt1K-6 80736 14952 ns/op 68.49 MB/s 0 B/op 0 allocs/op
BenchmarkSM4XTSEncrypt4K-6 20324 59351 ns/op 69.01 MB/s 0 B/op 0 allocs/op
对sm4-xts进行并发之后,
CPU: i5-9500
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm4_test
BenchmarkAES128XTSEncrypt512-6 1000000 1065 ns/op 480.82 MB/s 0 B/op 0 allocs/op
BenchmarkAES128XTSEncrypt1K-6 572985 2102 ns/op 487.20 MB/s 0 B/op 0 allocs/op
BenchmarkAES128XTSEncrypt4K-6 145036 8441 ns/op 485.26 MB/s 0 B/op 0 allocs/op
BenchmarkAES256XTSEncrypt512-6 925447 1225 ns/op 417.81 MB/s 0 B/op 0 allocs/op
BenchmarkAES256XTSEncrypt1K-6 500738 2465 ns/op 415.35 MB/s 0 B/op 0 allocs/op
BenchmarkAES256XTSEncrypt4K-6 127996 9408 ns/op 435.38 MB/s 0 B/op 0 allocs/op
BenchmarkSM4XTSEncrypt512-6 445254 2910 ns/op 175.93 MB/s 64 B/op 1 allocs/op
BenchmarkSM4XTSEncrypt1K-6 218763 5382 ns/op 190.28 MB/s 64 B/op 1 allocs/op
BenchmarkSM4XTSEncrypt4K-6 59904 19820 ns/op 206.66 MB/s 64 B/op 1 allocs/op
相比而言,差距还是有点大,最大的是GCM seal/open, 有二十多倍(AMD64使用AVX2后,大概十倍多)。
CPU: i5-8265U
goos: windows
goarch: amd64
pkg: github.com/emmansun/gmsm/sm4_test
BenchmarkAESCBCEncrypt1K-8 914280 1279 ns/op 800.49 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CBCEncrypt1K-8 59565 20344 ns/op 50.34 MB/s 0 B/op 0 allocs/op
BenchmarkAESCBCDecrypt1K-8 798015 1671 ns/op 612.98 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CBCDecrypt1K-8 194020 5739 ns/op 178.44 MB/s 0 B/op 0 allocs/op
BenchmarkAESCFBEncrypt1K-8 601120 2085 ns/op 488.83 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBEncrypt1K-8 67990 21284 ns/op 47.88 MB/s 0 B/op 0 allocs/op
BenchmarkAESCFBDecrypt1K-8 750609 1774 ns/op 574.31 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt1K-8 58508 17990 ns/op 56.64 MB/s 0 B/op 0 allocs/op
BenchmarkAESCFBDecrypt8K-8 82408 14005 ns/op 584.56 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt8K-8 8594 141197 ns/op 57.98 MB/s 0 B/op 0 allocs/op
BenchmarkAESOFB1K-8 1000000 1222 ns/op 833.65 MB/s 0 B/op 0 allocs/op
BenchmarkSM4OFB1K-8 46080 22127 ns/op 46.05 MB/s 0 B/op 0 allocs/op
BenchmarkAESCTR1K-8 801361 1373 ns/op 741.94 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR1K-8 226635 6007 ns/op 169.65 MB/s 0 B/op 0 allocs/op
BenchmarkAESCTR8K-8 109918 11466 ns/op 714.01 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR8K-8 28767 48448 ns/op 168.99 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSeal1K-8 4178898 308 ns/op 3329.79 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal1K-8 236019 5334 ns/op 191.99 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMOpen1K-8 4608244 313 ns/op 3272.58 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen1K-8 231393 8268 ns/op 123.85 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSign1K-8 7460964 182 ns/op 5619.56 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign1K-8 2458273 429 ns/op 2384.52 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSign8K-8 1000000 1066 ns/op 7681.92 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign8K-8 998692 1624 ns/op 5043.29 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSeal8K-8 639726 1707 ns/op 4798.52 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal8K-8 27079 48554 ns/op 168.72 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMOpen8K-8 668884 2139 ns/op 3829.27 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen8K-8 26265 53390 ns/op 153.44 MB/s 0 B/op 0 allocs/op
PASS
ok github.com/emmansun/gmsm/sm4_test 47.862s
CPU: i5-9500
goos: windows
goarch: amd64
BenchmarkAESCBCEncrypt1K-6 1000000 1006 ns/op 1017.63 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CBCEncrypt1K-6 87804 13595 ns/op 75.32 MB/s 0 B/op 0 allocs/op
BenchmarkAESCBCDecrypt1K-6 1240671 964 ns/op 1061.74 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CBCDecrypt1K-6 300069 4037 ns/op 253.68 MB/s 0 B/op 0 allocs/op
BenchmarkAESCFBEncrypt1K-6 876500 1425 ns/op 714.92 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBEncrypt1K-6 86581 13843 ns/op 73.61 MB/s 0 B/op 0 allocs/op
BenchmarkAESCFBDecrypt1K-6 878245 1338 ns/op 761.56 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt1K-6 86564 13823 ns/op 73.72 MB/s 0 B/op 0 allocs/op
BenchmarkAESCFBDecrypt8K-6 112794 10522 ns/op 778.09 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CFBDecrypt8K-6 10000 110776 ns/op 73.91 MB/s 0 B/op 0 allocs/op
BenchmarkAESOFB1K-6 1343679 892 ns/op 1142.41 MB/s 0 B/op 0 allocs/op
BenchmarkSM4OFB1K-6 89094 13409 ns/op 76.00 MB/s 0 B/op 0 allocs/op
BenchmarkAESCTR1K-6 1000000 1036 ns/op 984.00 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR1K-6 292957 4098 ns/op 248.66 MB/s 0 B/op 0 allocs/op
BenchmarkAESCTR8K-6 149863 8200 ns/op 998.46 MB/s 0 B/op 0 allocs/op
BenchmarkSM4CTR8K-6 36595 32699 ns/op 250.38 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSeal1K-6 4802740 249 ns/op 4113.39 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal1K-6 267092 4385 ns/op 233.52 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMOpen1K-6 5665056 212 ns/op 4836.19 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen1K-6 273200 4380 ns/op 233.80 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSign1K-6 9603033 124 ns/op 8258.43 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign1K-6 3725722 322 ns/op 3183.11 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSign8K-6 1570182 764 ns/op 10723.55 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSign8K-6 1244473 964 ns/op 8498.90 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMSeal8K-6 768501 1619 ns/op 5058.99 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMSeal8K-6 36162 33197 ns/op 246.77 MB/s 0 B/op 0 allocs/op
BenchmarkAESGCMOpen8K-6 944479 1325 ns/op 6183.50 MB/s 0 B/op 0 allocs/op
BenchmarkSM4GCMOpen8K-6 36162 33197 ns/op 246.77 MB/s 0 B/op 0 allocs/op
AES 128的加密代码(amd64),一轮一个指令搞定,并且一轮可以处理128位(轮数少,并行性好),这种性能差别也不奇怪。
// func encryptBlockAsm(nr int, xk *uint32, dst, src *byte)
TEXT ·encryptBlockAsm(SB),NOSPLIT,$0
MOVQ xk+8(FP), AX
MOVQ dst+16(FP), DX
MOVQ src+24(FP), BX
MOVUPS 0(AX), X1
MOVUPS 0(BX), X0
ADDQ $16, AX
PXOR X1, X0
MOVUPS 0(AX), X1
AESENC X1, X0
MOVUPS 16(AX), X1
AESENC X1, X0
MOVUPS 32(AX), X1
AESENC X1, X0
MOVUPS 48(AX), X1
AESENC X1, X0
MOVUPS 64(AX), X1
AESENC X1, X0
MOVUPS 80(AX), X1
AESENC X1, X0
MOVUPS 96(AX), X1
AESENC X1, X0
MOVUPS 112(AX), X1
AESENC X1, X0
MOVUPS 128(AX), X1
AESENC X1, X0
MOVUPS 144(AX), X1
AESENCLAST X1, X0
MOVUPS X0, 0(DX)
RET
done in v0.9.1
转到这里