Description
Environment:
- Framework: (TensorFlow, Keras, PyTorch, MXNet)
- Framework version: tf and pytorch nightly
- Horovod version:
- MPI version:
- CUDA version: 11.8
- NCCL version: 2.15.1+cuda11.8
- Python version:
- Spark / PySpark version:
- Ray version:
- OS and version:
- GCC version:
- CMake version:
Checklist:
- Did you search issues to find if somebody asked this question before?
- If your question is about hang, did you read this doc?
- If your question is about docker, did you read this doc?
- Did you check if you question is answered in the troubleshooting guide?
Bug report:
When using nccl 2.15.1+cuda11.8, some tests fail with nccl errors, this is one of the examples below:
[0]:___________________ ComputeWorkerTest.test_single_dispatcher ___________________
| [0]:
| [1]:test_compute_worker.py:73: in do_test_worker
| [0]:self = <test_compute_worker.ComputeWorkerTest testMethod=test_single_dispatcher>
| [1]: self.do_test_worker_compute_side(dispatchers, processing_mode=processing_mode, reuse_dataset=reuse_dataset, round_robin=round_robin)
| [0]:
| [1]:test_compute_worker.py:91: in do_test_worker_compute_side
| [0]: def test_single_dispatcher(self):
| [1]: cluster_shape = hvd.allgather_object((self.rank, self.size), name='test_start')
| [0]:> self.do_test_worker(1, reuse_dataset=False, round_robin=False)
| [1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/functions.py:212: in allgather_object
| [0]:
| [1]: sizes = to_numpy(allgather(sz, name=name + '.sz', process_set=process_set))
| [0]:test_compute_worker.py:53:
| [1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py:222: in allgather
| [0]:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [1]: return MPI_LIB.horovod_allgather(tensor, name=name,
| [0]:test_compute_worker.py:73: in do_test_worker
| [1]::357: in horovod_allgather
| [0]: self.do_test_worker_compute_side(dispatchers, processing_mode=processing_mode, reuse_dataset=reuse_dataset, round_robin=round_robin)
| [1]: ???
| [0]:test_compute_worker.py:91: in do_test_worker_compute_side
| [1]:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [0]: cluster_shape = hvd.allgather_object((self.rank, self.size), name='test_start')
| [1]:
| [0]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/functions.py:212: in allgather_object
| [1]:e = _NotOkStatusException(), name = 'test_start.sz'
| [0]: sizes = to_numpy(allgather(sz, name=name + '.sz', process_set=process_set))
| [1]:
| [0]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py:222: in allgather
| [1]: def raise_from_not_ok_status(e, name):
| [0]: return MPI_LIB.horovod_allgather(tensor, name=name,
| [1]: e.message += (" name: " + name if name is not None else "")
| [0]::357: in horovod_allgather
| [1]:> raise core.status_to_exception(e) from None # pylint: disable=protected-access
| [0]: ???
| [1]:E tensorflow.python.framework.errors_impl.UnknownError: {{function_node _wrapped__HorovodAllgather_device/job:localhost/replica:0/task:0/device:GPU:0}} ncclAllGather failed: invalid argument [Op:HorovodAllgather] name: test_start.sz
| [0]: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [1]:
| [0]:
| [1]:/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py:7252: UnknownError
| [0]:e = _NotOkStatusException(), name = 'test_start.sz'
There are also other errors related to allreduce.