Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan端侧训练segmentation fault #3129

Open
RabbitDF opened this issue Dec 19, 2024 · 5 comments
Open

vulkan端侧训练segmentation fault #3129

RabbitDF opened this issue Dec 19, 2024 · 5 comments
Labels
question Further information is requested

Comments

@RabbitDF
Copy link

平台:

  • 编译平台:Linux
  • 运行平台:Android / 三星Galaxy S23 / Adreno 740

Github版本:

  • 下载日期:2024.12.13
  • Commit id:47027e10992473eb4da4c536542fbefd6e5fcc1c

编译方式:

export ANDROID_NDK=/home/zhangfan/thirdparty/android-ndk-r27c
cd /home/zhangfan/thirdparty/MNN/
./schema/generate.sh
mkdir build
cd build
cmake .. -DMNN_BUILD_CONVERTER=true && make -j4
./tools/script/get_model.sh
cd project/android
mkdir build_64 && cd build_64 && ../build_64.sh -DMNN_BUILD_TRAIN=ON -DMNN_VULKAN=true -DMNN_OPENCL=true

编译日志:

-- MNN BUILD INFO:
--      System: Android
--      Processor: aarch64
--      Version: 3.0.1
--      Metal: OFF
--      OpenCL: true
--      OpenGL: OFF
--      Vulkan: true
--      ARM82: ON
--      KleidiAI: OFF
--      oneDNN: OFF
--      TensorRT: OFF
--      CoreML: OFF
--      NNAPI: OFF
--      CUDA: OFF
--      OpenMP: OFF
--      BF16: OFF
--      ThreadPool: ON
--      Hidden: TRUE
...
[100%] Built target run_test.out

端侧运行demo:

./adb shell "cd /data/local/tmp/build_64&&export LD_LIBRARY_PATH=.:./tools/train/:$LD_LIBRARY_PATH&&./runTrainDemo.out MnistTrain ./mnist/unzipped/"

实验:

修改"MNN/tools/train/source/demo/MnistUtils.cpp":40 exe->setGlobalExecutorConfig(MNN_FORWARD_VULKAN, config, 4); 的backend

  • MNN_FORWARD_CPU:正确运行
  • MNN_FORWARD_OPENCL:正确运行
  • MNN_FORWARD_VULKAN:segmentation fault

vulkan训练输出:

Can't open file:/sys/devices/system/cpu/cpufreq/boost/affected_cpus
CPU Group: [ 0  1  2 ], 307200 - 2016000
CPU Group: [ 3  4  5  6 ], 499200 - 2803200
CPU Group: [ 7 ], 595200 - 3360000
The device supports: i8sdot:1, fp16:1, i8mm: 1, sve2: 0
Vulkan don't support 9, Cast:
Vulkan don't support for , type=Convolution, Special case
Vulkan don't support for , type=Convolution, Special case
Vulkan don't support for , type=Raster, Special case
Vulkan don't support 9, Cast:
Vulkan don't support 119, OneHot:
epoch: 0  64 / 60000 loss: 2.30469 lr: 0.01 time: 100.997 ms / 0 iter
Segmentation fault

问题分析:

我们定位到是在第一次前向计算之后反向传播的时候报错:MNN\tools\train\source\demo\MnistUtils.cpp:124,sgd->step(loss);
再深层一些是在MNN\source\core\Pipeline.cpp:1053,auto code = iter.execution->onResize(iter.workInputs, iter.workOutputs);

想问一下MNN支持基于vulkan的端侧训练吗?有没有成功的demo呢?期待收到回复,谢谢~

@jxt1234
Copy link
Collaborator

jxt1234 commented Dec 19, 2024

使用 vulkan 进行训练时,编译时加上 -DMNN_VULKAN_IMAGE=false ,目前 vulkan 只有 buffer 分支支持反向的相关算子

@jxt1234 jxt1234 added the question Further information is requested label Dec 19, 2024
@RabbitDF
Copy link
Author

RabbitDF commented Dec 19, 2024

使用 vulkan 进行训练时,编译时加上 -DMNN_VULKAN_IMAGE=false ,目前 vulkan 只有 buffer 分支支持反向的相关算子

用 buffer 分支训起来啦!感谢大佬!

还有个问题,为什么我这儿看起来opencl和vulkan都比cpu训练还慢呢?比如NN::Linear算子vulkan没有实现,这种是不是会跑在CPU上?

  • cpu: 183.711 ms / 10 iter
  • opencl: 8697.82 ms / 10 iter
  • vulkan: 3411.07 ms / 10 iter

@jxt1234
Copy link
Collaborator

jxt1234 commented Dec 20, 2024

看上去可能是有算子回退,不过 NN::Linear 这个应该是支持的。另外 gpu resize 时耗时较多,计算量较小的模型用动态图训练时性能差。

@RabbitDF
Copy link
Author

RabbitDF commented Dec 20, 2024

看上去可能是有算子回退,不过 NN::Linear 这个应该是支持的。另外 gpu resize 时耗时较多,计算量较小的模型用动态图训练时性能差。

好的,我看到了 Linear 算子分成了 MatMul 和 Add ,vulkan 是有实现 MatMul 的,但是 Add 在 express/MathOp.cpp 中。
请问如果指定了 vulkan 后端,Add 算子会跟 MatMul 都执行在 GPU 上吗?

非常感谢大佬的帮助,祝天天开心 😄

@jxt1234
Copy link
Collaborator

jxt1234 commented Dec 23, 2024

这个应该都在 vulkan 上的。 Minist Train 之前有调试过,算子在 vulkan - buffer 分支上都是支持的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants