Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Issues with Docker #3

Open
cjlovering opened this issue Oct 22, 2019 · 10 comments
Open

Issues with Docker #3

cjlovering opened this issue Oct 22, 2019 · 10 comments

Comments

@cjlovering
Copy link

After installing docker (on MacOS), the build failed. I am on the latest commit in master.

I get the following message:

Traceback (most recent call last):
  File "setup.py", line 759, in <module>
    build_deps()
  File "setup.py", line 311, in build_deps
    cmake=cmake)
  File "/src/pytorch/tools/build_pytorch_libs.py", line 59, in build_caffe2
    cmake.build(my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 334, in build
    self.run(build_args, my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 142, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/root/miniconda3/envs/torchbeast/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.
@heiner
Copy link
Contributor

heiner commented Oct 23, 2019

Hey Charles, thanks for reporting this!

I just pushed e233fbc which should resolve this issue. Please let us know if you run into further problems.

BTW please note that we had limited success building the Docker image on MacOS as it seems to stall while compiling PyTorch. This may be a resource constraint.

@cjlovering
Copy link
Author

cjlovering commented Oct 23, 2019

Hello Heinrich,

Thanks for the help, unfortunately this did not end up fixing the issue to me; in the end a similar error occurred. (I included a few more messages from the build.) I will try building on a linux machine, and see if I can get it to work there.

Best,
Charles

[...]
[1363/2619] Building CXX object caffe2/CMakeFiles/net_async_tracing_test.dir/core/net_async_tracing_test.cc.o
[1364/2619] Building CXX object caffe2/CMakeFiles/kernel_stackbased_test.dir/__/aten/src/ATen/core/op_registration/kernel_stackbased_test.cpp.o
[1365/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_dlpack.cc.o
[1366/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_registry.cc.o
[1367/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o
FAILED: caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o 
/usr/bin/c++  -DAT_PARALLEL_OPENMP=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -D_FILE_OFFSET_BITS=64 -D_THP_CORE -Dcaffe2_pybind11_state_EXPORTS -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -Icaffe2/aten/src/TH -I../aten/src/TH -Icaffe2/aten/src -Iaten/src -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /root/miniconda3/envs/torchbeast/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /root/miniconda3/envs/torchbeast/include/python3.7m -isystem /root/miniconda3/envs/torchbeast/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /opt/rocm/hip/include -isystem /include -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -fPIC   -fvisibility=hidden -DCAFFE2_USE_GLOO -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -fopenmp -std=gnu++11 -MD -MT caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -MF caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o.d -o caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -c ../caffe2/python/pybind_state.cc
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
[1368/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_int8.cc.o
[1369/2619] Building CXX object caffe2/CMakeFiles/kernel_functor_test.dir/__/aten/src/ATen/core/op_registration/kernel_functor_test.cpp.o
[1370/2619] Building CXX object caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state_nomni.cc.o
ninja: build stopped: subcommand failed.
Building wheel torch-1.2.0a0+54a63e0
-- Building version 1.2.0a0+54a63e0
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/src/pytorch/torch -DCMAKE_PREFIX_PATH=/root/miniconda3/envs/torchbeast -DNUMPY_INCLUDE_DIR=/root/miniconda3/envs/torchbeast/lib/python3.7/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/root/miniconda3/envs/torchbeast/bin/python -DPYTHON_INCLUDE_DIR=/root/miniconda3/envs/torchbeast/include/python3.7m -DPYTHON_LIBRARY=/root/miniconda3/envs/torchbeast/lib/libpython3.7m.so.1.0 -DTORCH_BUILD_VERSION=1.2.0a0+54a63e0 -DUSE_CUDA=False -DUSE_DISTRIBUTED=True -DUSE_NUMPY=True -DUSE_SYSTEM_EIGEN_INSTALL=OFF /src/pytorch
cmake --build . --target install --config Release -- -j 4
Traceback (most recent call last):
  File "setup.py", line 756, in <module>
    build_deps()
  File "setup.py", line 325, in build_deps
    cmake=cmake)
  File "/src/pytorch/tools/build_pytorch_libs.py", line 64, in build_caffe2
    cmake.build(my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 321, in build
    self.run(build_args, my_env)
  File "/src/pytorch/tools/setup_helpers/cmake.py", line 133, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/root/miniconda3/envs/torchbeast/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.
The command '/bin/bash -c python setup.py install' returned a non-zero code: 1

@edran
Copy link

edran commented Nov 11, 2019

Hey @cjlovering, did you end up making progress on this?

@cjlovering
Copy link
Author

cjlovering commented Nov 11, 2019

Hey @cjlovering, did you end up making progress on this?

Hello @edran, so far no. I tried again with new installs and updated source on MacOS and it did not work. I have not had a chance do so on a linux machine (as I don't have immediate access to one with sufficient permissions). I will be coming back to this issue again soon.

@bottler
Copy link

bottler commented Nov 11, 2019

Docker Desktop on mac has its own set of resource constraints (controlled in Preferences -> Advanced). I wonder if the internal compiler error is caused by failing to allocate memory due to the constraint.

@cjlovering
Copy link
Author

Hello! I've stopped trying to get this to work on Mac, and now trying on Google cloud.

I have gotten docker and polybeast to run, but I have not been able do use GPUs with it. Do you have a recommended approach for using docker with GPUs?

@edran
Copy link

edran commented Nov 26, 2019

@cjlovering you most likely want to use nvidia-docker, and modify our image to:

  1. either have cuda installed before pytorch;
  2. or simply replace https://github.com/facebookresearch/torchbeast/blob/master/Dockerfile#L2 with an image from NVIDIA's hub: https://hub.docker.com/r/nvidia/cuda/ (the 18.04 cudnn one should work in theory).

@cjlovering
Copy link
Author

cjlovering commented Nov 26, 2019

@edran Thank you! (I followed the second option and used nvidia/cuda:10.1-base-ubuntu18.04).

I think the GPUs are available and cuda is installed. For instance, I was able to run nvidia-smi and see the GPU status (by adding another CMD to the dockerfile). However, when the polybeast script is run it does not find that cuda available.

Is there something I should update in the pytorch installation or something along those lines?

@edran
Copy link

edran commented Nov 26, 2019

I don't know whether the base image is enough to pull both cuda and cudnn, and it's possible that both might be required for pytorch to be compiled with cuda support. Try using 10.1-cudnn7-runtime-ubuntu18.04.

Also, if you share your dockerfile I can give it a go locally to see whether I spot issues.

@cjlovering
Copy link
Author

cjlovering commented Nov 26, 2019

Thanks! I tried updating the image with that and it didn't seem to work for me.

Here's the file (with the updated image):
https://github.com/cjlovering/torchbeast/blob/master/Dockerfile

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants