NCNN RVV int8 and fp16sa optimization #2

Xinyu302 · 2024-02-28T08:54:56Z

复现环境

官方buildroot镜像，55MB 内存+256MB swap内存（不开启swap时部分模型的在55MB内存下运行比较极限，效果不如开启swap内存）。
编译ncnn：

$ cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/c906-v240-musl.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_BENCHMARK=OFF -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON -DNCNN_BUILD_TESTS=OFF ..
$ make -j 8

测试情况

优化中加入了对Conv的Winograd优化，但是由于该优化会使得模型weight变大，部分模型效果不好，因此修改了benchncnn源码，控制模型是否使用Winograd优化.

此外，除官方说明的可以在55MB下运行的模型外，本优化可以使squeezenet_ssd_int8模型正常运行。

运行命令：

./benchncnn 4 1 0 -1 0

模型运行时间：

          squeezenet  min =  224.48  max =  226.46  avg =  225.52
     squeezenet_int8  min =  540.41  max =  551.01  avg =  545.77
           mobilenet  min =  336.35  max =  338.45  avg =  337.75
      mobilenet_int8  min =  888.55  max =  889.63  avg =  889.00
        mobilenet_v2  min =  235.79  max =  237.13  avg =  236.29
        mobilenet_v3  min =  208.19  max =  209.76  avg =  208.71
          shufflenet  min =  166.60  max =  167.01  avg =  166.78
       shufflenet_v2  min =  156.89  max =  159.10  avg =  158.19
             mnasnet  min =  256.52  max =  257.56  avg =  257.06
     proxylessnasnet  min =  299.75  max =  300.23  avg =  299.93
     efficientnet_b0  min =  365.72  max =  366.99  avg =  366.16
   efficientnetv2_b0  min =  501.54  max =  504.23  avg =  502.88
        regnety_400m  min =  344.23  max =  346.39  avg =  345.04
           blazeface  min =  119.88  max =  120.41  avg =  120.07
      googlenet_int8  min = 2746.41  max = 2748.56  avg = 2747.50
       resnet18_int8  min = 2782.48  max = 2792.17  avg = 2788.06
 squeezenet_ssd_int8  min = 3491.84  max = 3586.81  avg = 3524.24
       mobilenet_ssd  min =  775.98  max =  785.64  avg =  778.90
  mobilenet_ssd_int8  min = 1947.19  max = 1978.77  avg = 1956.45
  mobilenetv2_yolov3  min =  821.81  max =  826.52  avg =  824.13
           nanodet_m  min =  395.29  max =  396.09  avg =  395.72
    yolo-fastest-1.1  min =  185.70  max =  186.40  avg =  186.13
      yolo-fastestv2  min =  159.52  max =  160.26  avg =  159.77
          FastestDet  min =  165.56  max =  168.29  avg =  166.85

…tn_mask detection (Tencent#5273)

Signed-off-by: Molly Sophia <[email protected]>

)

* fix sigbus error when loading fp16 model on armv7 * apply for bf16

…t fp16u (Tencent#5287)

Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@v3...v4) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: hugo-syn <[email protected]>

* pnnx handle two operands add/sub/rsub variant * fuse dynamic slice indexes, wip * pnnx sliceindexes * reset device may change non-dtype input numeric 5 to 6 * print inf as float * preserve dtype for generation op * pnnx convert torch.masked_select * test masked_select * test negative slice

Signed-off-by: Xilin Wu <[email protected]>

* replace storexxx to vsseg2e32_v_f32m1 * refine transpose --------- Co-authored-by: Xinyu302 <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

…encent#5324)

* promote vfpv4 for auto fp16 storage conversion * always report neon and vfpv4 for arm64

* add basic shufflechannel * finish but bug * fix bug * apply code-format changes --------- Co-authored-by: Xinyu302 <[email protected]>

* convd 3x3 packn asm * apply code-format changes --------- Co-authored-by: fty1777 <[email protected]>

* add float2int8 * pass compile * apply code-format changes * add float2int8leakyrelu * add quantize_riscv but has bug * fix bug * fix bug of float2int8relu * apply code-format changes * Add quantize fp16 (#6) * finish quantize fp16 to int8 * fix quantize bug * apply code-format changes --------- Co-authored-by: Xinyu302 <[email protected]> * Add dequantize (#3) * add-deq * apply code-format changes * ongoing dequantize * apply code-format changes * finish dequantize_riscv, but has bug * fix vset bug * fix bug * delete debug info * apply code-format changes * refine dequantize --------- Co-authored-by: Xinyu302 <[email protected]> * Add innerproduct (#5) * add int innerproduct * apply code-format changes * finish innerproduct, but not tested yet * apply code-format changes * bug because no fp16 quantize * change flatten to make it right * pass test * delete useless code * apply code-format changes * delete useless code --------- Co-authored-by: Xinyu302 <[email protected]> * copy arm convolutiondepthwise to convolutiondepthwise_riscv.cpp * change int8 packn to 8 (#7) * change int8 packn to 8 * test_convert packing right * modify header (#8) * finish convolutiondepthwise_3x3_pack8_int8 * apply code-format changes * delete comment * Fix pack (#10) * modify * debug * fix quantize and innerproduct * apply code-format changes --------- Co-authored-by: Xinyu302 <[email protected]> * pack8 maybe right * apply code-format changes * debug * debug * modify padding using pack8 (#12) * apply code-format changes * use pack8 * apply code-format changes * pack8 right * add basic conv * apply code-format changes * finish requantize but has bug (#13) * finish requantize but has bug * fix bug * delete comment * apply code-format changes --------- Co-authored-by: Xinyu302 <[email protected]> * now can use requantize * delete comment * add arm base, now to rewrite it to riscv-v extension * apply code-format changes * try to finish * apply code-format changes * try to add pack8 * try to handle vpadalq_s16 * apply code-format changes * finish kernel. pass test * use new kernel * fix kernel bug * pass test * apply code-format changes * fix net.cpp layer pack * fix segfault bug * add fp16 dequantize_riscv.cpp * use same elesize * remove comment * delete comment * apply code-format changes * dequantize fp16sa * apply code-format changes * WIP: buggy int8 packn * WIP: maybe fixed * apply code-format changes * fix depthwise conv bug --------- Co-authored-by: Xinyu302 <[email protected]> Co-authored-by: fty1777 <[email protected]>

* reorder inst * convd 3x3 pack1 * apply code-format changes --------- Co-authored-by: fty1777 <[email protected]>

* WIP: conv wino int8 * change f16 to i16 * top_blob_tm create 4u * packn * WIP: conv 3x3 winograd transform input(1/2)/output(0/2)/kernel(2/2done) * finish winograd23 int8 transform * WIP: conv 3x3 winograd transform input(1/2)/output(1/2)/kernel(2/2done) * apply code-format changes * fix bug in convolution_winograd_dot_packn_int8.h * can compile, not test yet * use winograd transform kernel * winograd23 result divide 2, now can pass test * apply code-format changes * winograd23 riscv int8 opt * conv winograd43 riscv int8 --------- Co-authored-by: fty1777 <[email protected]> Co-authored-by: fty1777 <[email protected]> Co-authored-by: Xinyu302 <[email protected]>

* conv 3x3 pack1ton * apply code-format changes --------- Co-authored-by: fty1777 <[email protected]> Co-authored-by: fty1777 <[email protected]>

shiptux · 2024-03-26T04:55:18Z

尊敬的参赛选手，您好。
本次锦标赛您所提交的 PR 初步复测结果如 https://github.com/plctlab/rvspoc/blob/main/Results/Verifications/S2310/README.md 所示。如有任何异议请回复本条评论。如确认无误请回复 “确认无误”，感谢您的配合。

Xinyu302 · 2024-03-26T09:29:43Z

确认无误

shiptux · 2024-03-26T09:30:47Z

确认无误

感谢您的回复。

nihui and others added 30 commits January 10, 2024 13:51

workaround l2 norm produce -inf value with subnormals (Tencent#5272)

ba42369

do not eliminate noop math if shape changes, improve torch-2.1 mha at…

76dcaa4

…tn_mask detection (Tencent#5273)

pnnx model binary over 4g (Tencent#5274)

c557fb6

Add Xeon Phi 3120A results (Tencent#5276)

2ac1e77

add orangepi zero2 ncnn benchmark (Tencent#5277)

28078b7

Add Dimensity 9300 MT6989 ncnn benchmark (Tencent#5284)

09f15e6

Signed-off-by: Molly Sophia <[email protected]>

pnnx handle index_put with empty indices and scalar values (Tencent#5288

7ed252c

)

pnnx convert some cudnn conv2d variants (Tencent#5289)

a705a24

fix cast armv7 sigbus when loading fp16 model (Tencent#5292)

656b082

* fix sigbus error when loading fp16 model on armv7 * apply for bf16

check vulkan fp16 uniform support and implement lfp conversion withou…

5329d32

…t fp16u (Tencent#5287)

report vulkan cm 8x8x16 config, enable fp16a cm (Tencent#5298)

05b4dcb

chore: Fix multiple typos (Tencent#5301)

f35eb4b

Signed-off-by: hugo-syn <[email protected]>

chore: add markdown code highlight (Tencent#5302)

7d8019d

Signed-off-by: hugo-syn <[email protected]>

Grammer and typos fix suggestion (Tencent#5304)

ff17c17

port stb image optimization (Tencent#5307)

7928d44

x86 avx512 optimization for mish (Tencent#5309)

0942efa

add PhytiumPi ncnn benchmark (Tencent#5312)

66b26b6

convolution_x86: Fix typo in logging (Tencent#5310)

294e786

Signed-off-by: Xilin Wu <[email protected]>

add remipi benchmark (Tencent#5319)

10fd242

RVV: Refine riscv gemm fp32 (Tencent#5303)

7ac4268

* replace storexxx to vsseg2e32_v_f32m1 * refine transpose --------- Co-authored-by: Xinyu302 <[email protected]>

Add Radxa Zero 3W to benchmark (Tencent#5321)

5fdbb54

python pnnx add option to change fp16 parameter (Tencent#5320)

545a367

Signed-off-by: Molly Sophia <[email protected]>

Add draw rectangle, draw text, draw circle, and draw line to C API (T…

f676326

…encent#5324)

disable signal based detectisa if being debugged (Tencent#5280)

87d7165

fix debug build on some compiler, fix Tencent#5295 (Tencent#5326)

d38bdbd

fix uwp build (Tencent#5328)

5b536af

promote vfpv4 for auto fp16 storage conversion (Tencent#5325)

984d6dd

* promote vfpv4 for auto fp16 storage conversion * always report neon and vfpv4 for arm64

use 4 job for github ci (Tencent#5330)

f5b55c6

Xinyu302 and others added 26 commits February 23, 2024 13:07

undef C906 in conv, asm optimize for transpose

da3a86a

apply code-format changes

15713c4

fix bug for vfredusum

0b2f7eb

Add shufflechannel (#16)

185b522

* add basic shufflechannel * finish but bug * fix bug * apply code-format changes --------- Co-authored-by: Xinyu302 <[email protected]>

copy slice from arm

22caf90

delte out_elempack=4

b50e318

apply code-format changes

3d2b015

Add ConvolutionDepthwise 3x3 packn asm (#17)

74a81ee

* convd 3x3 packn asm * apply code-format changes --------- Co-authored-by: fty1777 <[email protected]>

reorder inst (#19)

3624d54

delete useless code

411b6ab

apply code-format changes

071f3e7

Add convd 3x3 pack1 riscv (#20)

1ad4fc6

* reorder inst * convd 3x3 pack1 * apply code-format changes --------- Co-authored-by: fty1777 <[email protected]>

fix stupid and useless malloc

2de11ca

bugfix convolution packn riscv fp16

fa5ba7f

try to make sgemm int8 faster

1f5ffcb

delete print

4f2cba7

delete print

6d4ac4b

apply code-format changes

424efe1

Conv 3x3 pack1ton int8 try (#22)

6f68d57

* conv 3x3 pack1ton * apply code-format changes --------- Co-authored-by: fty1777 <[email protected]> Co-authored-by: fty1777 <[email protected]>

apply code-format changes

0b64e63

Disable winograd for specific models in benchncnn

d06f13b

apply code-format changes

3aa3625

modify benchncnn.cpp to control use of winograd

3ec321c

delete benchmark/benchncnn_nowino.cpp

64b2eca

nihui mentioned this pull request May 6, 2024

Bad performance for int8 inference on XuanTie 906 (RISC-V) Tencent/ncnn#5447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCNN RVV int8 and fp16sa optimization #2

NCNN RVV int8 and fp16sa optimization #2

Xinyu302 commented Feb 28, 2024

shiptux commented Mar 26, 2024

Xinyu302 commented Mar 26, 2024

shiptux commented Mar 26, 2024

NCNN RVV int8 and fp16sa optimization #2

Are you sure you want to change the base?

NCNN RVV int8 and fp16sa optimization #2

Conversation

Xinyu302 commented Feb 28, 2024

复现环境

测试情况

shiptux commented Mar 26, 2024

Xinyu302 commented Mar 26, 2024

shiptux commented Mar 26, 2024