8000 Using nccl 2.15.1+cuda11.8 results in test failures in multiple tests · Issue #3779 · horovod/horovod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Using nccl 2.15.1+cuda11.8 results in test failures in multiple tests #3779
Closed
@Tixxx

Description

@Tixxx

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet)
  2. Framework version: tf and pytorch nightly
  3. Horovod version:
  4. MPI version:
  5. CUDA version: 11.8
  6. NCCL version: 2.15.1+cuda11.8
  7. Python version:
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version:
  11. GCC version:
  12. CMake version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
When using nccl 2.15.1+cuda11.8, some tests fail with nccl errors, this is one of the examples below:

[0]:___________________ ComputeWorkerTest.test_single_dispatcher ___________________

  | [0]:
  | [1]:test_compute_worker.py:73: in do_test_worker
  | [0]:self = <test_compute_worker.ComputeWorkerTest testMethod=test_single_dispatcher>
  | [1]: self.do_test_worker_compute_side(dispatchers, processing_mode=processing_mode, reuse_dataset=reuse_dataset, round_robin=round_robin)
  | [0]:
  | [1]:test_compute_worker.py:91: in do_test_worker_compute_side
  | [0]: def test_single_dispatcher(self):
  | [1]: cluster_shape = hvd.allgather_object((self.rank, self.size), name='test_start')
  | [0]:> self.do_test_worker(1, reuse_dataset=False, round_robin=False)
  | [1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/functions.py:212: in allgather_object
  | [0]:
  | [1]: sizes = to_numpy(allgather(sz, name=name + '.sz', process_set=process_set))
  | [0]:test_compute_worker.py:53:
  | [1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py:222: in allgather
  | [0]:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [1]: return MPI_LIB.horovod_allgather(tensor, name=name,
  | [0]:test_compute_worker.py:73: in do_test_worker
  | [1]::357: in horovod_allgather
  | [0]: self.do_test_worker_compute_side(dispatchers, processing_mode=processing_mode, reuse_dataset=reuse_dataset, round_robin=round_robin)
  | [1]: ???
  | [0]:test_compute_worker.py:91: in do_test_worker_compute_side
  | [1]:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [0]: cluster_shape = hvd.allgather_object((self.rank, self.size), name='test_start')
  | [1]:
  | [0]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/functions.py:212: in allgather_object
  | [1]:e = _NotOkStatusException(), name = 'test_start.sz'
  | [0]: sizes = to_numpy(allgather(sz, name=name + '.sz', process_set=process_set))
  | [1]:
  | [0]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py:222: in allgather
  | [1]: def raise_from_not_ok_status(e, name):
  | [0]: return MPI_LIB.horovod_allgather(tensor, name=name,
  | [1]: e.message += (" name: " + name if name is not None else "")
  | [0]::357: in horovod_allgather
  | [1]:> raise core.status_to_exception(e) from None # pylint: disable=protected-access
  | [0]: ???
  | [1]:E tensorflow.python.framework.errors_impl.UnknownError: {{function_node _wrapped__HorovodAllgather_device/job:localhost/replica:0/task:0/device:GPU:0}} ncclAllGather failed: invalid argument [Op:HorovodAllgather] name: test_start.sz
  | [0]:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [1]:
  | [0]:
  | [1]:/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py:7252: UnknownError
  | [0]:e = _NotOkStatusException(), name = 'test_start.sz'

There are also other errors related to allreduce.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0