-
Notifications
You must be signed in to change notification settings - Fork 64
Armv8.2 SM3和SM4
Sun Yimin edited this page Oct 10, 2023
·
31 revisions
go test -v -short -bench . -run=^$ ./...
goos: linux
goarch: arm64
pkg: github.com/emmansun/gmsm/sm3
BenchmarkHash8Bytes
BenchmarkHash8Bytes-2 2738724 438.4 ns/op 18.25 MB/s
BenchmarkHash1K
BenchmarkHash1K-2 192519 6232 ns/op 164.32 MB/s
BenchmarkHash8K
BenchmarkHash8K-2 24950 48112 ns/op 170.27 MB/s
BenchmarkHash8K_SH256
BenchmarkHash8K_SH256-2 223354 5369 ns/op 1525.81 MB/s
PASS
ok github.com/emmansun/gmsm/sm3 5.857s
和CPU指令级别的差距基本上是10倍!
AESE指令相当于:
- AddRoundKey(state, RoudKey)
- ShiftRows(State)
- SubBytes(State)
所以,如果RoundKey = 0, 那么AESE相当于执行了
- ShiftRows(State)
- SubBytes(State)
使用全0 RoundKey有没有什么副作用?
go test -v -short -bench . -run=^$ ./...
goos: linux
goarch: arm64
pkg: github.com/emmansun/gmsm/sm4
BenchmarkEncrypt
BenchmarkEncrypt-2 2145859 559.1 ns/op 28.62 MB/s
BenchmarkDecrypt
BenchmarkDecrypt-2 2145296 559.4 ns/op 28.60 MB/s
BenchmarkExpand
BenchmarkExpand-2 2064466 581.2 ns/op
PASS
ok github.com/emmansun/gmsm/sm4 5.334s
SM4EKEY SM4E 目前golang还没有支持SM4E/SM4EKEY指令,不过我们可以根据不支持的操作码来处理:
- Clone codes from https://github.com/golang/arch
- 修改arm64asm/tables.go: 增加SM4E/SM4EKEY常量;同时加入opstr;加入指令到instFormats。
// SM4E <Vd>.4S, <Vn>.4S
{0xfffffc00, 0xcec08400, SM4E, instArgs{arg_Vd_arrangement_4S, arg_Vn_arrangement_4S}, nil},
// SM4EKEY <Vd>.4S, <Vn>.4S, <Vm>.4S
{0xffe0fc00, 0xce60c800, SM4EKEY, instArgs{arg_Vd_arrangement_4S, arg_Vn_arrangement_4S, arg_Vm_arrangement_4S}, nil},
- 修改arm64asm/plan9x.go,noSuffixOpSet里加上SM4E和SM4EKEY,这个是可选的,加了的话,plan9x的指令就不会出现V前缀。
- 写测试,testDecodeLine()方法是从decode_test.go的testDecode()方法中抽出来的。看了那个Decode()方法就能编码出那些32位的code了。
func TestDecodeSM4Codes(t *testing.T) {
//gnu syntax, load 16 bytes plaintext to v8 (need to reverse byte order first), 32 round keys to v0-v7, the final result should be reverse byte order again
testDecodeLine(t, "gnu", "0884c0ce| sm4e v8.4s, v0.4s")
testDecodeLine(t, "gnu", "2884c0ce| sm4e v8.4s, v1.4s")
testDecodeLine(t, "gnu", "4884c0ce| sm4e v8.4s, v2.4s")
testDecodeLine(t, "gnu", "6884c0ce| sm4e v8.4s, v3.4s")
testDecodeLine(t, "gnu", "8884c0ce| sm4e v8.4s, v4.4s")
testDecodeLine(t, "gnu", "a884c0ce| sm4e v8.4s, v5.4s")
testDecodeLine(t, "gnu", "c884c0ce| sm4e v8.4s, v6.4s")
testDecodeLine(t, "gnu", "e884c0ce| sm4e v8.4s, v7.4s")
//plan9 syntax, load 16 bytes plaintext to v8 (need to reverse byte order first), 32 round keys to v0-v7, the final result should be reverse byte order again
testDecodeLine(t, "plan9", "0884c0ce| SM4E V0.S4, V8.S4")
testDecodeLine(t, "plan9", "2884c0ce| SM4E V1.S4, V8.S4")
testDecodeLine(t, "plan9", "4884c0ce| SM4E V2.S4, V8.S4")
testDecodeLine(t, "plan9", "6884c0ce| SM4E V3.S4, V8.S4")
testDecodeLine(t, "plan9", "8884c0ce| SM4E V4.S4, V8.S4")
testDecodeLine(t, "plan9", "a884c0ce| SM4E V5.S4, V8.S4")
testDecodeLine(t, "plan9", "c884c0ce| SM4E V6.S4, V8.S4")
testDecodeLine(t, "plan9", "e884c0ce| SM4E V7.S4, V8.S4")
//gnu syntax, load 32 ck to v0-v7, root key (reverse byte order first) xor fk to v8, the result round keys will be in v9, need to move v9 to v8 from second invocation of sm4ekey
testDecodeLine(t, "gnu", "09c960ce| sm4ekey v9.4s, v8.4s, v0.4s")
testDecodeLine(t, "gnu", "09c961ce| sm4ekey v9.4s, v8.4s, v1.4s")
testDecodeLine(t, "gnu", "09c962ce| sm4ekey v9.4s, v8.4s, v2.4s")
testDecodeLine(t, "gnu", "09c963ce| sm4ekey v9.4s, v8.4s, v3.4s")
testDecodeLine(t, "gnu", "09c964ce| sm4ekey v9.4s, v8.4s, v4.4s")
testDecodeLine(t, "gnu", "09c965ce| sm4ekey v9.4s, v8.4s, v5.4s")
testDecodeLine(t, "gnu", "09c966ce| sm4ekey v9.4s, v8.4s, v6.4s")
testDecodeLine(t, "gnu", "09c967ce| sm4ekey v9.4s, v8.4s, v7.4s")
//gnu syntax, load 32 ck to v0-v7, root key (reverse byte order first) xor fk to v8, the result round keys will be in v9 (1,3,5,7) and v8 (2,4,6,8),避免寄存器copy。
testDecodeLine(t, "gnu", "09c960ce| sm4ekey v9.4s, v8.4s, v0.4s")
testDecodeLine(t, "gnu", "28c961ce| sm4ekey v8.4s, v9.4s, v1.4s")
testDecodeLine(t, "gnu", "09c962ce| sm4ekey v9.4s, v8.4s, v2.4s")
testDecodeLine(t, "gnu", "28c963ce| sm4ekey v8.4s, v9.4s, v3.4s")
testDecodeLine(t, "gnu", "09c964ce| sm4ekey v9.4s, v8.4s, v4.4s")
testDecodeLine(t, "gnu", "28c965ce| sm4ekey v8.4s, v9.4s, v5.4s")
testDecodeLine(t, "gnu", "09c966ce| sm4ekey v9.4s, v8.4s, v6.4s")
testDecodeLine(t, "gnu", "28c967ce| sm4ekey v8.4s, v9.4s, v7.4s")
}
每次sm4e/sm4ekey只能执行4轮,所以需要调用8次。
4.然后,你就可以在golang的arm64的汇编中使用那些32位的codes了。
WORD $0x0884c0ce // SM4E V0.S4, V8.S4
[3/30/2023] 通过进一步学习和QEMU环境测试,发现不需要进行字节序变换。以下才是正确的!项目中的SM3 SM4 NI实现已经通过QEMU测试。
WORD $0xcec08408 // SM4E V0.S4, V8.S4
用指令字的缺点主要是易读性差,另外一个就是不能或不好写宏代码。
P1(X)= X XOR (X <<< 15) XOR (X <<< 23)
P1(X1 XOR X2)
=(X1 XOR X2) XOR ((X1 XOR X2) <<< 15) XOR ((X1 XOR X2) <<< 23)
=X1 XOR X2 XOR (X1 <<< 15) XOR (X2 <<< 15) XOR (X1 <<< 23) XOR (X2 <<< 23)
=X1 XOR (X1 <<< 15) XOR (X1 <<< 23) XOR X2 XOR (X2 <<< 15) XOR (X2 <<< 23)
=P1(X1) XOR P1(X2)
这里, 异或XOR运算满足:
交换律
结合律
并且假定(X1 XOR X2) <<< 15 = (X1 <<< 15) XOR (X2 <<< 15), 也就是说循环左移ROL对异或XOR运算满足分配律,这一点是不显然的。
SM3PARTW1中最后一个字:
Vd[3] = P1(C XOR (R1 <<< 15)), 这里 C 是另外两个字的异或结果, R1 是 X(4i+16)的一部分:X(4i+16) = R1 XOR R2
SM3PARTW2中的tmp.value[0]就是R2
P1(C XOR (R1 <<< 15)) XOR P1(R2 <<< 15) = P1(C XOR (R1 <<< 15) XOR (R2 <<< 15)) = P1(C XOR ((R1 XOR R2) <<< 15))
所以,关键就是循环位移对异或运算满足分配律成立, 或者更一般的,逻辑位移运算对异或运算满足分配律, Does a shift operation distribute over XOR。
SM3和SM4 CPU指令实现,找不到相关CPU环境,mark先。
- Summary of A64 cryptographic instructions
- Arm A64 Instruction Set Architecture
- linux arm64 crypto / (https://github.com/torvalds/linux/tree/master/arch/arm64/crypto)
- A Quick Guide to Go's Assembler
- Golang arm instructions mapping
- A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.
- asm2go