Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCNN RVV int8 and fp16sa optimization #2

Open
wants to merge 67 commits into
base: master
Choose a base branch
from

Conversation

Xinyu302
Copy link

复现环境

  • 官方buildroot镜像,55MB 内存+256MB swap内存(不开启swap时部分模型的在55MB内存下运行比较极限,效果不如开启swap内存)。
  • 编译ncnn:
$ cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/c906-v240-musl.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_BENCHMARK=OFF -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON -DNCNN_BUILD_TESTS=OFF ..
$ make -j 8

测试情况

优化中加入了对Conv的Winograd优化,但是由于该优化会使得模型weight变大,部分模型效果不好,因此修改了benchncnn源码,控制模型是否使用Winograd优化.

此外,除官方说明的可以在55MB下运行的模型外,本优化可以使squeezenet_ssd_int8模型正常运行。

运行命令:

./benchncnn 4 1 0 -1 0

模型运行时间:

          squeezenet  min =  224.48  max =  226.46  avg =  225.52
     squeezenet_int8  min =  540.41  max =  551.01  avg =  545.77
           mobilenet  min =  336.35  max =  338.45  avg =  337.75
      mobilenet_int8  min =  888.55  max =  889.63  avg =  889.00
        mobilenet_v2  min =  235.79  max =  237.13  avg =  236.29
        mobilenet_v3  min =  208.19  max =  209.76  avg =  208.71
          shufflenet  min =  166.60  max =  167.01  avg =  166.78
       shufflenet_v2  min =  156.89  max =  159.10  avg =  158.19
             mnasnet  min =  256.52  max =  257.56  avg =  257.06
     proxylessnasnet  min =  299.75  max =  300.23  avg =  299.93
     efficientnet_b0  min =  365.72  max =  366.99  avg =  366.16
   efficientnetv2_b0  min =  501.54  max =  504.23  avg =  502.88
        regnety_400m  min =  344.23  max =  346.39  avg =  345.04
           blazeface  min =  119.88  max =  120.41  avg =  120.07
      googlenet_int8  min = 2746.41  max = 2748.56  avg = 2747.50
       resnet18_int8  min = 2782.48  max = 2792.17  avg = 2788.06
 squeezenet_ssd_int8  min = 3491.84  max = 3586.81  avg = 3524.24
       mobilenet_ssd  min =  775.98  max =  785.64  avg =  778.90
  mobilenet_ssd_int8  min = 1947.19  max = 1978.77  avg = 1956.45
  mobilenetv2_yolov3  min =  821.81  max =  826.52  avg =  824.13
           nanodet_m  min =  395.29  max =  396.09  avg =  395.72
    yolo-fastest-1.1  min =  185.70  max =  186.40  avg =  186.13
      yolo-fastestv2  min =  159.52  max =  160.26  avg =  159.77
          FastestDet  min =  165.56  max =  168.29  avg =  166.85

nihui and others added 30 commits January 10, 2024 13:51
* fix sigbus error when loading fp16 model on armv7

* apply for bf16
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](actions/cache@v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* pnnx handle two operands add/sub/rsub variant

* fuse dynamic slice indexes, wip

* pnnx sliceindexes

* reset device may change non-dtype input numeric 5 to 6

* print inf as float

* preserve dtype for generation op

* pnnx convert torch.masked_select

* test masked_select

* test negative slice
* replace storexxx to vsseg2e32_v_f32m1

* refine transpose

---------

Co-authored-by: Xinyu302 <[email protected]>
* promote vfpv4 for auto fp16 storage conversion

* always report neon and vfpv4 for arm64
Xinyu302 and others added 26 commits February 23, 2024 13:07
* add basic shufflechannel

* finish but bug

* fix bug

* apply code-format changes

---------

Co-authored-by: Xinyu302 <[email protected]>
* convd 3x3 packn asm

* apply code-format changes

---------

Co-authored-by: fty1777 <[email protected]>
* add float2int8

* pass compile

* apply code-format changes

* add float2int8leakyrelu

* add quantize_riscv but has bug

* fix bug

* fix bug of float2int8relu

* apply code-format changes

* Add quantize fp16 (#6)

* finish quantize fp16 to int8

* fix quantize bug

* apply code-format changes

---------

Co-authored-by: Xinyu302 <[email protected]>

* Add dequantize (#3)

* add-deq

* apply code-format changes

* ongoing dequantize

* apply code-format changes

* finish dequantize_riscv, but has bug

* fix vset bug

* fix bug

* delete debug info

* apply code-format changes

* refine dequantize

---------

Co-authored-by: Xinyu302 <[email protected]>

* Add innerproduct (#5)

* add int innerproduct

* apply code-format changes

* finish innerproduct, but not tested yet

* apply code-format changes

* bug because no fp16 quantize

* change flatten to make it right

* pass test

* delete useless code

* apply code-format changes

* delete useless code

---------

Co-authored-by: Xinyu302 <[email protected]>

* copy arm convolutiondepthwise to convolutiondepthwise_riscv.cpp

* change int8 packn to 8 (#7)

* change int8 packn to 8

* test_convert packing right

* modify header (#8)

* finish convolutiondepthwise_3x3_pack8_int8

* apply code-format changes

* delete comment

* Fix pack (#10)

* modify

* debug

* fix quantize and innerproduct

* apply code-format changes

---------

Co-authored-by: Xinyu302 <[email protected]>

* pack8 maybe right

* apply code-format changes

* debug

* debug

* modify padding using pack8 (#12)

* apply code-format changes

* use pack8

* apply code-format changes

* pack8 right

* add basic conv

* apply code-format changes

* finish requantize but has bug (#13)

* finish requantize but has bug

* fix bug

* delete comment

* apply code-format changes

---------

Co-authored-by: Xinyu302 <[email protected]>

* now can use requantize

* delete comment

* add arm base, now to rewrite it to riscv-v extension

* apply code-format changes

* try to finish

* apply code-format changes

* try to add pack8

* try to handle vpadalq_s16

* apply code-format changes

* finish kernel. pass test

* use new kernel

* fix kernel bug

* pass test

* apply code-format changes

* fix net.cpp layer pack

* fix segfault bug

* add fp16 dequantize_riscv.cpp

* use same elesize

* remove comment

* delete comment

* apply code-format changes

* dequantize fp16sa

* apply code-format changes

* WIP: buggy int8 packn

* WIP: maybe fixed

* apply code-format changes

* fix depthwise conv bug

---------

Co-authored-by: Xinyu302 <[email protected]>
Co-authored-by: fty1777 <[email protected]>
* reorder inst

* convd 3x3 pack1

* apply code-format changes

---------

Co-authored-by: fty1777 <[email protected]>
* WIP: conv wino int8

* change f16 to i16

* top_blob_tm create 4u * packn

* WIP: conv 3x3 winograd transform input(1/2)/output(0/2)/kernel(2/2done)

* finish winograd23 int8 transform

* WIP: conv 3x3 winograd transform input(1/2)/output(1/2)/kernel(2/2done)

* apply code-format changes

* fix bug in convolution_winograd_dot_packn_int8.h

* can compile, not test yet

* use winograd transform kernel

* winograd23 result divide 2, now can pass test

* apply code-format changes

* winograd23 riscv int8 opt

* conv winograd43 riscv int8

---------

Co-authored-by: fty1777 <[email protected]>
Co-authored-by: fty1777 <[email protected]>
Co-authored-by: Xinyu302 <[email protected]>
* conv 3x3 pack1ton

* apply code-format changes

---------

Co-authored-by: fty1777 <[email protected]>
Co-authored-by: fty1777 <[email protected]>
@shiptux
Copy link

shiptux commented Mar 26, 2024

尊敬的参赛选手,您好。
本次锦标赛您所提交的 PR 初步复测结果如 https://github.com/plctlab/rvspoc/blob/main/Results/Verifications/S2310/README.md 所示。如有任何异议请回复本条评论。如确认无误请回复 “确认无误”,感谢您的配合。

@Xinyu302
Copy link
Author

确认无误

@shiptux
Copy link

shiptux commented Mar 26, 2024

确认无误

感谢您的回复。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.