Using nccl 2.15.1+cuda11.8 results in test failures in multiple tests

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet)
Framework version: tf and pytorch nightly
Horovod version:
MPI version:
CUDA version: 11.8
NCCL version: 2.15.1+cuda11.8
Python version:
Spark / PySpark version:
Ray version:
OS and version:
GCC version:
CMake version:

Checklist:

Did you search issues to find if somebody asked this question before?
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Bug report:
When using nccl 2.15.1+cuda11.8, some tests fail with nccl errors, this is one of the examples below:

[0]:___________________ ComputeWorkerTest.test_single_dispatcher ___________________

| [0]:
| [1]:test_compute_worker.py:73: in do_test_worker
| [0]:self = <test_compute_worker.ComputeWorkerTest testMethod=test_single_dispatcher>
| [1]: self.do_test_worker_compute_side(dispatchers, processing_mode=processing_mode, reuse_dataset=reuse_dataset, round_robin=round_robin)
| [0]:
| [1]:test_compute_worker.py:91: in do_test_worker_compute_side
| [0]: def test_single_dispatcher(self):
| [1]: cluster_shape = hvd.allgather_object((self.rank, self.size), name='test_start')
| [0]:> self.do_test_worker(1, reuse_dataset=False, round_robin=False)
| [1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/functions.py:212: in allgather_object
| [0]:
| [1]: sizes = to_numpy(allgather(sz, name=name + '.sz', process_set=process_set))
| [0]:test_compute_worker.py:53:
| [1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py:222: in allgather
| [0]:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [1]: return MPI_LIB.horovod_allgather(tensor, name=name,
| [0]:test_compute_worker.py:73: in do_test_worker
| [1]::357: in horovod_allgather
| [0]: self.do_test_worker_compute_side(dispatchers, processing_mode=processing_mode, reuse_dataset=reuse_dataset, round_robin=round_robin)
| [1]: ???
| [0]:test_compute_worker.py:91: in do_test_worker_compute_side
| [1]:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [0]: cluster_shape = hvd.allgather_object((self.rank, self.size), name='test_start')
| [1]:
| [0]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/functions.py:212: in allgather_object
| [1]:e = _NotOkStatusException(), name = 'test_start.sz'
| [0]: sizes = to_numpy(allgather(sz, name=name + '.sz', process_set=process_set))
| [1]:
| [0]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py:222: in allgather
| [1]: def raise_from_not_ok_status(e, name):
| [0]: return MPI_LIB.horovod_allgather(tensor, name=name,
| [1]: e.message += (" name: " + name if name is not None else "")
| [0]::357: in horovod_allgather
| [1]:> raise core.status_to_exception(e) from None # pylint: disable=protected-access
| [0]: ???
| [1]:E tensorflow.python.framework.errors_impl.UnknownError: {{function_node _wrapped__HorovodAllgather_device/job:localhost/replica:0/task:0/device:GPU:0}} ncclAllGather failed: invalid argument [Op:HorovodAllgather] name: test_start.sz
| [0]: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [1]:
| [0]:
| [1]:/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py:7252: UnknownError
| [0]:e = _NotOkStatusException(), name = 'test_start.sz'

There are also other errors related to allreduce.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0]:___________________ ComputeWorkerTest.test_single_dispatcher ___________________

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

[0]:___________________ ComputeWorkerTest.test_single_dispatcher ___________________

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions