8000 Tags · chr1sj0nes/nccl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tags: chr1sj0nes/nccl

Tags

v2.11.4-1

Toggle v2.11.4-1's commit message
2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

v2.10.3-1

Toggle v2.10.3-1's commit message
2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

v2.9.9-1

Toggle v2.9.9-1's commit message
2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue NVIDIA#505)

v2.9.8-1

Toggle v2.9.8-1's commit message
2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

v2.9.6-1

Toggle v2.9.6-1's commit message
2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue NVIDIA#439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

v2.8.4-1

Toggle v2.8.4-1's commit message
2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

v2.8.3-1

Toggle v2.8.3-1's commit message
2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix NVIDIA#379 : topology injection failing when using less GPUs than
described in the XML.
Fix NVIDIA#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

v2.7.8-1

Toggle v2.7.8-1's commit message
2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

v2.7.6-1

Toggle v2.7.6-1's commit message
2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

v2.7.5-1

Toggle v2.7.5-1's commit message
2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.
0