Add collective latency profiler #1785
Open
+856
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details
This feature profiles the collective information like collective type, latency, variance, message size, etc. The purpose of this profiler is too parts.
What were the changes?
Added
src/include/latency_profiler/CollTrace.h
src/include/latency_profiler/CollTraceEvent.h
src/include/latency_profiler/CollTraceFunc.h
src/include/latency_profiler/CollTraceUtils.h
src/include/latency_profiler/EventQueue.h
src/misc/latency_profiler/CollTrace.cc
src/misc/latency_profiler/CollTraceEvent.cc
src/misc/latency_profiler/CollTraceFunc.cc
src/misc/latency_profiler/CollTraceUtils.cc
test/latency_profiler/LatencyProfilerUnitTest.cpp
CMakeLists.txt
Modified
src/enqueue.cc
src/include/comm.h
src/init.cc
test/CMakeLists.txt
Implementation steps
Why were the changes made?
The work is previously done via manually analyzing traces, which is hard to capture the whole end to end behavior and analyzing multiple gpu traces for a single collective execution stats is not scalable.
How was the outcome achieved?
./rccl-UnitTests --gtest_filter=CollTraceUtilsTest*
================================================================================
Environment variables:
================================================================================
Note: Google Test filter = CollTraceUtilsTest*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from CollTraceUtilsTest
[ RUN ] CollTraceUtilsTest.aggregateResultsTest
[ OK ] CollTraceUtilsTest.aggregateResultsTest (0 ms)
[ RUN ] CollTraceUtilsTest.EventQueueOperationTest
[ OK ] CollTraceUtilsTest.EventQueueOperationTest (0 ms)
[ RUN ] CollTraceUtilsTest.EventQueueMultiThreadTest
[ OK ] CollTraceUtilsTest.EventQueueMultiThreadTest (0 ms)
[ RUN ] CollTraceUtilsTest.getSizeMbTest
[ OK ] CollTraceUtilsTest.getSizeMbTest (0 ms)
[----------] 4 tests from CollTraceUtilsTest (0 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (0 ms total)
[ PASSED ] 4 tests.
[ INFO ] Total executed cases: 0
[ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS)
[ TIMING ] CollTraceUtilsTest : aggregateResultsTest: 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : EventQueueOperationTest: 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : EventQueueMultiThreadTest: 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : getSizeMbTest : 0.00 sec (PASS)
[ TIMING ] CollTraceUtilsTest : TOTAL : 0.00 sec (PASS)
[ TIMING ] Total time: 0.00 minutes
set RCCL_LATENCY_PROFILER=1 when running RCCL tests, in the NCCL debug files we could see information like below.
Please note that in meta's internal environment, we have internal tools (scuba) which helps us to do aggregating, filtering and visualizations, but we do not have that in oss env, so we output to files. In the code, we leave a placeholder for integration into meta.
devgpu039:494282:495674 [0] NCCL INFO coll_id 0, percent 11, min_latency_us 88312.796875, max_latency_us 98300.320312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 1, percent 18, min_latency_us 80149.375000, max_latency_us 94813.242188, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 2, percent 18, min_latency_us 68449.570312, max_latency_us 81227.570312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 3, percent 5, min_latency_us 87649.710938, max_latency_us 92216.945312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 4, percent 14, min_latency_us 88092.070312, max_latency_us 100858.929688, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 5, percent 178, min_latency_us 9818.096680, max_latency_us 27337.013672, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 6, percent 101, min_latency_us 6809.101074, max_latency_us 13728.096680, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 7, percent 152, min_latency_us 5829.539062, max_latency_us 14748.134766, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 8, percent 138, min_latency_us 8587.885742, max_latency_us 20474.212891, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 9, percent 295, min_latency_us 3697.302979, max_latency_us 14604.616211, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
Additional Documentation:
Shared with AMD rccl team privately
Approval Checklist
Do not approve until these items are satisfied.