Closed
Description
Environment:
- Framework: (TensorFlow, Keras, PyTorch, MXNet): PyTorch
- Framework version: 1.12.0.dev
- Horovod version: 0.24.2
- MPI version: 3.1
- Rocm version: ROCM 5.0+
- NCCL version:
- Python version: 3.8
- Spark / PySpark version:
- Ray version:
- OS and version: Ubuntu2004
- GCC version:
- CMake version:
Checklist:
- Did you search issues to find if somebody asked this question before?
- If your question is about hang, did you read this doc?
- If your question is about docker, did you read this doc?
- Did you check if you question is answered in the troubleshooting guide?
Bug report:
When I "pip install horovod" for rocm 5.0.1 and rocm 5.1.1, got error:
Stacktrace:
[pip-requirements.txt] Found existing installation: numpy 1. ERROR: Command errored out with exit status 1:
command: /opt/conda/envs/ptca/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h7ut125g/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/envs/ptca/include/python3.8/horovod
cwd: /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/
Complete output (280 lines):
running install
/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/horovod
copying horovod/__init__.py -> build/lib.linux-x86_64-3.8/horovod
creating build/lib.linux-x86_64-3.8/horovod/spark
copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.8/horovod/spark
copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.8/horovod/spark
copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.8/horovod/spark
copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.8/horovod/spark
copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.8/horovod/spark
........ (skip copying)
copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
creating build/lib.linux-x86_64-3.8/horovod/torch/elastic
copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
creating build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib_impl
running build_ext
Running CMake in build/temp.linux-x86_64-3.8/RelWithDebInfo:
cmake /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/build/lib.linux-x86_64-3.8 -DPYTHON_EXECUTABLE:FILEPATH=/opt/conda/envs/ptca/bin/python
cmake --build . --config RelWithDebInfo -- -j8 VERBOSE=1
-- Could not find CCache. Consider installing CCache to speed up compilation.
-- The CXX compiler identification is GNU 9.4.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build architecture flags: -mf16c -mavx -mfma
-- Using command /opt/conda/envs/ptca/bin/python
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - NOTFOUND
-- Looking for a CUDA host compiler - /usr/bin/c++
-- Could not find nvcc, please set CUDAToolkit_ROOT.
-- Could NOT find NVTX (missing: NVTX_INCLUDE_DIR)
-- The C compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Gloo build as STATIC library
-- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include
-- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'tensorflow'
-- Could NOT find Tensorflow (missing: Tensorflow_LIBRARIES) (Required is at least version "1.15.0")
-- Found Pytorch: 1.12.0.dev20220505+rocm5.0 (found suitable version "1.12.0.dev20220505+rocm5.0", minimum required is "1.2.0")
Successfully preprocessed all matching files.
Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 0
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'mxnet'
-- Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least version "1.4.0")
-- Gloo build as STATIC library
-- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include
-- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
-- Configuring done
CMake Error at horovod/torch/CMakeLists.txt:81 (add_library):
Cannot find source file:
/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/horovod/torch/ready_event_hip.cc
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
.hpp .hxx .in .txx
CMake Error at horovod/torch/CMakeLists.txt:81 (add_library):
No SOURCES given to target: pytorch
CMake Generate step failed. Build files cannot be regenerated correctly.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py", line 209, in <module>
setup(name='horovod',
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py", line 68, in run
return orig.install.run(self)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/install.py", line 545, in run
self.run_command('build')
File "/opt/conda/envs/ptca/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py", line 144, in build_extensions
subprocess.check_call(command, cwd=cmake_build_dir)
File "/opt/conda/envs/ptca/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/build/lib.linux-x86_64-3.8', '-DPYTHON_EXECUTABLE:FILEPATH=/opt/conda/envs/ptca/bin/python']' returned non-zero exit status 1.
----------------------------------------
ERROR: Command errored out with exit status 1: /opt/conda/envs/ptca/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h7ut125g/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/envs/ptca/include/python3.8/horovod Check the logs for full command output.
22.3
Error as above. I tried ROCM 5.0.1 and ROCM 5.1.1, and both failed.
Can you please take a look?
Thanks