Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oneDNN #937

Closed
wants to merge 23 commits into from
Closed

Add oneDNN #937

wants to merge 23 commits into from

Conversation

graemenail
Copy link
Member

@graemenail graemenail commented Jun 3, 2022

TODO

  • Manual check of training / scoring

Description

Replaces MKL with oneDNN. This PR should enable completely open-source releases of Marian.

This PR also includes a caching of boost, and a cleaning of the debug build directory. These can be broken off into a separate PR if necessary. During testing, windows builds would fail from running out of space building both debug and release. Since disabling oneDNN JIT profiling the build sizes are smaller.

Related: marian-nmt/marian-regression-tests#86
Closes: #706

List of changes:

  • Adds oneDNN as a submodule
  • Removes MKL related code
  • Use oneDNN for sgemm
  • Modified code paths:
    • batched GEMM (cblas_sgemm_batch in prod.cpp)
    • TransposeFirst3In4 (mkl_somatcopy in tensor_operators.cpp)
    • Bias copy-op (mkl_somatcopy in packed_gemm.cpp)

The following have some MKL component, and have not been touched:

  • azure workflows
  • src/microsoft code
  • vs/*

Added dependencies: Intel oneDNN (removing Intel MKL)

How to test

Ran the 1 million sentence testset of WNGT21 through the MKL and oneDNN versions.

Using the wngt19 model from regression tests.

Setup
relative-paths: true
models:
  - wngt19/model.base.npz
vocabs:
  - wngt19/en-de.spm
  - wngt19/en-de.spm
normalize: 0.6
beam-size: 1
mini-batch: 128
maxi-batch: 100
maxi-batch-sort: src
cpu-threads: 16

Ran #cpu tagged regression tests. Non-intgemm tests pass.

Regression test results
Skipped:
- tests/models/wngt19/test_model_base_fbgemm_packed16.sh
- tests/models/wngt19/test_model_base_fbgemm_packed8.sh
- tests/server/test_ende_cpu.sh
Failed:
- tests/decoder/intgemm/test_intgemm_16bit.sh
- tests/decoder/intgemm/test_intgemm_16bit_avx2.sh
- tests/decoder/intgemm/test_intgemm_16bit_sse2.sh
- tests/decoder/intgemm/test_intgemm_8bit.sh
- tests/decoder/intgemm/test_intgemm_8bit_avx2.sh
- tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh
Logs:
- /home/gnail/dev/marian-dev/regression-tests/tests/decoder/intgemm/test_intgemm_16bit.sh.log
- /home/gnail/dev/marian-dev/regression-tests/tests/decoder/intgemm/test_intgemm_16bit_avx2.sh.log
- /home/gnail/dev/marian-dev/regression-tests/tests/decoder/intgemm/test_intgemm_16bit_sse2.sh.log
- /home/gnail/dev/marian-dev/regression-tests/tests/decoder/intgemm/test_intgemm_8bit.sh.log
- /home/gnail/dev/marian-dev/regression-tests/tests/decoder/intgemm/test_intgemm_8bit_avx2.sh.log
- /home/gnail/dev/marian-dev/regression-tests/tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh.log
---------------------
Ran 19 tests in 00:03:6.267s, 10 passed, 3 skipped, 6 failed
FAILED

The static binaries are now larger (merge #938 first for comparison).

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@snukky
Copy link
Member

snukky commented Sep 12, 2022

I merged this with the recent master (https://github.com/marian-nmt/marian-dev/tree/graemenail-onednn) and tested with the regression tests that we have. Intgemm tests and one wngt19 8bit test fail, but I guess this is expected and in fact can be also because of #933. In our tests, except tests with quantized 8bit models tested on CPU, it also throws the abort here. Other tests pass.

As for other comments, as discussed earlier, instead of replacing MKL with oneDNN, we would like to have this backward compatible at least for some time and to keep the previous behavior as default. So oneDNN would be enabled via CMake flag and MKL is used by default if detected.

@graemenail
Copy link
Member Author

Thanks for the comments @snukky. On the LSH abort, it's a while since I looked at this, I don't believe oneDNN provides all the BLAS functions required. From a quick check, I'd guess this #706 (comment) is still the status of non-GEMM BLAS in oneDNN.

I had started on the oneDNN + MKL implementation, I'll dig it out.

@graemenail graemenail mentioned this pull request Sep 14, 2022
4 tasks
@graemenail
Copy link
Member Author

Closing in favour of #967

@graemenail graemenail closed this Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider oneDNN instead of MKL for SGEMM
2 participants