Add collective latency profiler #1785

ycui1984 · 2025-07-02T05:01:55Z

Details

This feature profiles the collective information like collective type, latency, variance, message size, etc. The purpose of this profiler is too parts.

it tells you what are the most expensive collectives after running an end to end workloads, which helps guiding the optimization direction, like optimize allReduce or allGather, the message that we should pay attention to, etc.
It profiles the variance across different ranks, and provide an aggregation view. Sometimes, we wanted to confirm that a collective is running well across all ranks, etc.

What were the changes?

Added
src/include/latency_profiler/CollTrace.h
src/include/latency_profiler/CollTraceEvent.h
src/include/latency_profiler/CollTraceFunc.h
src/include/latency_profiler/CollTraceUtils.h
src/include/latency_profiler/EventQueue.h
src/misc/latency_profiler/CollTrace.cc
src/misc/latency_profiler/CollTraceEvent.cc
src/misc/latency_profiler/CollTraceFunc.cc
src/misc/latency_profiler/CollTraceUtils.cc
test/latency_profiler/LatencyProfilerUnitTest.cpp
CMakeLists.txt
Modified
src/enqueue.cc
src/include/comm.h
src/init.cc
test/CMakeLists.txt

Implementation steps

measure collective latency by adding cuda/hip event before and after a kernel launch in RCCL.
Started a separate worker thread to collect latency data into its local ring buffer.
Perform CPU level all gather to exchange latency data between threads (200 to 300us per 100 latency data exchange)
Report when results buffer is full or every few minutes

Why were the changes made?
The work is previously done via manually analyzing traces, which is hard to capture the whole end to end behavior and analyzing multiple gpu traces for a single collective execution stats is not scalable.

How was the outcome achieved?

unit tests
./rccl-UnitTests --gtest_filter=CollTraceUtilsTest*
================================================================================
Environment variables:

UT_DEBUG_PAUSE Pause for debugger attach ( 0)
UT_SHOW_NAMES Show test case names ( 1)
UT_MIN_GPUS Minimum number of GPUs to use ( 1)
UT_MAX_GPUS Maximum number of GPUs to use ( 8)
UT_POW2_GPUS Only allow power-of-2 # of GPUs ( 0)
UT_PROCESS_MASK Whether to run single/multi process ( 3)
UT_VERBOSE Show verbose unit test output ( 0)
UT_REDOPS List of reduction ops to test ( -1)
UT_DATATYPES List of datatypes to test ( -1)
UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU ( 1)
UT_PRINT_VALUES Print array values (-1 for all) ( 0)
UT_SHOW_TIMING Show timing table ( 1)
UT_INTERACTIVE Run in interactive mode ( 0)
UT_TIMEOUT_US Timeout limit for collective calls in us (5000000)
UT_MULTITHREAD Multi-thread single-process ranks ( 0)
================================================================================
Note: Google Test filter = CollTraceUtilsTest*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from CollTraceUtilsTest
[ RUN ] CollTraceUtilsTest.aggregateResultsTest
[ OK ] CollTraceUtilsTest.aggregateResultsTest (0 ms)
[ RUN ] CollTraceUtilsTest.EventQueueOperationTest
[ OK ] CollTraceUtilsTest.EventQueueOperationTest (0 ms)
[ RUN ] CollTraceUtilsTest.EventQueueMultiThreadTest
[ OK ] CollTraceUtilsTest.EventQueueMultiThreadTest (0 ms)
[ RUN ] CollTraceUtilsTest.getSizeMbTest
[ OK ] CollTraceUtilsTest.getSizeMbTest (0 ms)
[----------] 4 tests from CollTraceUtilsTest (0 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (0 ms total)
[ PASSED ] 4 tests.
[ INFO ] Total executed cases: 0
[ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS)
[ TIMING ] CollTraceUtilsTest : aggregateResultsTest: 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : EventQueueOperationTest: 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : EventQueueMultiThreadTest: 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : getSizeMbTest : 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : TOTAL : 0.00 sec (PASS)
[ TIMING ] Total time: 0.00 minutes

run the rccl tests,
set RCCL_LATENCY_PROFILER=1 when running RCCL tests, in the NCCL debug files we could see information like below.
Please note that in meta's internal environment, we have internal tools (scuba) which helps us to do aggregating, filtering and visualizations, but we do not have that in oss env, so we output to files. In the code, we leave a placeholder for integration into meta.

devgpu039:494282:495674 [0] NCCL INFO coll_id 0, percent 11, min_latency_us 88312.796875, max_latency_us 98300.320312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 1, percent 18, min_latency_us 80149.375000, max_latency_us 94813.242188, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 2, percent 18, min_latency_us 68449.570312, max_latency_us 81227.570312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 3, percent 5, min_latency_us 87649.710938, max_latency_us 92216.945312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 4, percent 14, min_latency_us 88092.070312, max_latency_us 100858.929688, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 5, percent 178, min_latency_us 9818.096680, max_latency_us 27337.013672, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 6, percent 101, min_latency_us 6809.101074, max_latency_us 13728.096680, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 7, percent 152, min_latency_us 5829.539062, max_latency_us 14748.134766, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 8, percent 138, min_latency_us 8587.885742, max_latency_us 20474.212891, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 9, percent 295, min_latency_us 3697.302979, max_latency_us 14604.616211, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322

Additional Documentation:
Shared with AMD rccl team privately

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

thananon · 2025-07-10T13:58:23Z

src/include/latency_profiler/CollTraceFunc.h

+#include "comm.h"
+#include "CollTraceUtils.h"
+
+namespace meta {


There is already namespace facebook in our codebase and this one is introducing meta. I am not sure if this is a great idea for OSS. @corey-derochie-amd

Is it possible to remove company-specific namespace and change it to something generic?

ycui1984 added 6 commits July 1, 2025 01:03

[LatencyProfiler] Initial commit

fb9ec80

[LatencyProfiler] Add unit tests

fd71013

[LatencyProfiler] add more

c68bcd7

[LatencyProfiler] Pass unit tests

73f5fa5

[LatencyProfiler] Add hooks to integrate with meta internal tools

8000

91c3506

[LatencyProfiler] Restore install.sh

e5a876d

thananon reviewed Jul 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add collective latency profiler #1785

Add collective latency profiler #1785

Uh oh!

Uh oh!

Uh oh!

Add collective latency profiler #1785

Are you sure you want to change the base?

Add collective latency profiler #1785

Conversation

Details

Approval Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!