8000 Can't pip install horovod for rocm 5.0+ · Issue #3537 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Can't pip install horovod for rocm 5.0+ #3537
Closed
@xiaoyu-work

Description

@xiaoyu-work

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet): PyTorch
  2. Framework version: 1.12.0.dev
  3. Horovod version: 0.24.2
  4. MPI version: 3.1
  5. Rocm version: ROCM 5.0+
  6. NCCL version:
  7. Python version: 3.8
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu2004
  11. GCC version:
  12. CMake version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
When I "pip install horovod" for rocm 5.0.1 and rocm 5.1.1, got error:

Stacktrace:

[pip-requirements.txt]     Found existing installation: numpy 1.    ERROR: Command errored out with exit status 1:
     command: /opt/conda/envs/ptca/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h7ut125g/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/envs/ptca/include/python3.8/horovod
         cwd: /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/
    Complete output (280 lines):
    running install
    /opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.8
    creating build/lib.linux-x86_64-3.8/horovod
    copying horovod/__init__.py -> build/lib.linux-x86_64-3.8/horovod
    creating build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.8/horovod/spark
........ (skip copying)
    copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
    copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
    copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
    creating build/lib.linux-x86_64-3.8/horovod/torch/elastic
    copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
    copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
    copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
    creating build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib
    copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib
    creating build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib_impl
    copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib_impl
    running build_ext
    Running CMake in build/temp.linux-x86_64-3.8/RelWithDebInfo:
    cmake /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/build/lib.linux-x86_64-3.8 -DPYTHON_EXECUTABLE:FILEPATH=/opt/conda/envs/ptca/bin/python
    cmake --build . --config RelWithDebInfo -- -j8 VERBOSE=1

    -- Could not find CCache. Consider installing CCache to speed up compilation.
    -- The CXX compiler identification is GNU 9.4.0
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Build architecture flags: -mf16c -mavx -mfma
    -- Using command /opt/conda/envs/ptca/bin/python
    -- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- Looking for a CUDA compiler
    -- Looking for a CUDA compiler - NOTFOUND
    -- Looking for a CUDA host compiler - /usr/bin/c++
    -- Could not find nvcc, please set CUDAToolkit_ROOT.
    -- Could NOT find NVTX (missing: NVTX_INCLUDE_DIR)
    -- The C compiler identification is GNU 9.4.0
    -- Check for working C compiler: /usr/bin/cc
    -- Check for working C compiler: /usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Gloo build as STATIC library
    -- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include
    -- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ModuleNotFoundError: No module named 'tensorflow'
    -- Could NOT find Tensorflow (missing: Tensorflow_LIBRARIES) (Required is at least version "1.15.0")
    -- Found Pytorch: 1.12.0.dev20220505+rocm5.0 (found suitable version "1.12.0.dev20220505+rocm5.0", minimum required is "1.2.0")
    Successfully preprocessed all matching files.
    Total number of unsupported CUDA function calls: 0
    
    Total number of replaced kernel launches: 0
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ModuleNotFoundError: No module named 'mxnet'
    -- Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least version "1.4.0")
    -- Gloo build as STATIC library
    -- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include
    -- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
    -- Configuring done
    CMake Error at horovod/torch/CMakeLists.txt:81 (add_library):
      Cannot find source file:
    
        /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/horovod/torch/ready_event_hip.cc
    
      Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
      .hpp .hxx .in .txx
    
    
    CMake Error at horovod/torch/CMakeLists.txt:81 (add_library):
      No SOURCES given to target: pytorch
   
    CMake Generate step failed.  Build files cannot be regenerated correctly.
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py", line 209, in <module>
        setup(name='horovod',
      File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py", line 68, in run
        return orig.install.run(self)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py", line 144, in build_extensions
        subprocess.check_call(command, cwd=cmake_build_dir)
      File "/opt/conda/envs/ptca/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/build/lib.linux-x86_64-3.8', '-DPYTHON_EXECUTABLE:FILEPATH=/opt/conda/envs/ptca/bin/python']' returned non-zero exit status 1.
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/conda/envs/ptca/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h7ut125g/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/envs/ptca/include/python3.8/horovod Check the logs for full command output.
22.3

Error as above. I tried ROCM 5.0.1 and ROCM 5.1.1, and both failed.

Can you please take a look?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0