Tags · chr1sj0nes/nccl

v2.11.4-1

2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

Sep 8, 2021
e11238b
zip
tar.gz

v2.10.3-1

2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

Jul 8, 2021
7e51592
zip
tar.gz

v2.9.9-1

2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue NVIDIA#505)

May 12, 2021
3fec2fa
zip
tar.gz

v2.9.8-1

2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

May 10, 2021
ca8485b
zip
tar.gz

v2.9.6-1

2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue NVIDIA#439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

Apr 12, 2021
a46ea10
zip
tar.gz

v2.8.4-1

2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

Feb 9, 2021
911d61f
zip
tar.gz

v2.8.3-1

2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix NVIDIA#379 : topology injection failing when using less GPUs than
described in the XML.
Fix NVIDIA#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

Nov 17, 2020
920dbe5
zip
tar.gz

v2.7.8-1

2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

Jul 27, 2020
033d799
zip
tar.gz

v2.7.6-1

2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

Jun 26, 2020
1952325
zip
tar.gz

v2.7.5-1

2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

Jun 26, 2020
01afd20
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.11.4-1

v2.10.3-1

v2.9.9-1

v2.9.8-1

v2.9.6-1

v2.8.4-1

v2.8.3-1

v2.7.8-1

v2.7.6-1

v2.7.5-1

Tags: chr1sj0nes/nccl