8000 Add collective latency profiler by ycui1984 · Pull Request #1785 · ROCm/rccl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add collective latency profiler #1785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

ycui1984
Copy link
@ycui1984 ycui1984 commented Jul 2, 2025

Details

This feature profiles the collective information like collective type, latency, variance, message size, etc. The purpose of this profiler is too parts.

  1. it tells you what are the most expensive collectives after running an end to end workloads, which helps guiding the optimization direction, like optimize allReduce or allGather, the message that we should pay attention to, etc.
  2. It profiles the variance across different ranks, and provide an aggregation view. Sometimes, we wanted to confirm that a collective is running well across all ranks, etc.

What were the changes?

  1. Added
    src/include/latency_profiler/CollTrace.h
    src/include/latency_profiler/CollTraceEvent.h
    src/include/latency_profiler/CollTraceFunc.h
    src/include/latency_profiler/CollTraceUtils.h
    src/include/latency_profiler/EventQueue.h
    src/misc/latency_profiler/CollTrace.cc
    src/misc/latency_profiler/CollTraceEvent.cc
    src/misc/latency_profiler/CollTraceFunc.cc
    src/misc/latency_profiler/CollTraceUtils.cc
    test/latency_profiler/LatencyProfilerUnitTest.cpp
    CMakeLists.txt

  2. Modified
    src/enqueue.cc
    src/include/comm.h
    src/init.cc
    test/CMakeLists.txt

Implementation steps

  1. measure collective latency by adding cuda/hip event before and after a kernel launch in RCCL.
  2. Started a separate worker thread to collect latency data into its local ring buffer.
  3. Perform CPU level all gather to exchange latency data between threads (200 to 300us per 100 latency data exchange)
  4. Report when results buffer is full or every few minutes

Why were the changes made?
The work is previously done via manually analyzing traces, which is hard to capture the whole end to end behavior and analyzing multiple gpu traces for a single collective execution stats is not scalable.

How was the outcome achieved?

  1. unit tests
    ./rccl-UnitTests --gtest_filter=CollTraceUtilsTest*
    ================================================================================
    Environment variables:
  • UT_DEBUG_PAUSE Pause for debugger attach ( 0)
  • UT_SHOW_NAMES Show test case names ( 1)
  • UT_MIN_GPUS Minimum number of GPUs to use ( 1)
  • UT_MAX_GPUS Maximum number of GPUs to use ( 8)
  • UT_POW2_GPUS Only allow power-of-2 # of GPUs ( 0)
  • UT_PROCESS_MASK Whether to run single/multi process ( 3)
  • UT_VERBOSE Show verbose unit test output ( 0)
  • UT_REDOPS List of reduction ops to test ( -1)
  • UT_DATATYPES List of datatypes to test ( -1)
  • UT_MAX_RANKS_PER_GPU Maximum number of ranks using the same GPU ( 1)
  • UT_PRINT_VALUES Print array values (-1 for all) ( 0)
  • UT_SHOW_TIMING Show timing table ( 1)
  • UT_INTERACTIVE Run in interactive mode ( 0)
  • UT_TIMEOUT_US Timeout limit for collective calls in us (5000000)
  • UT_MULTITHREAD Multi-thread single-process ranks ( 0)
    ================================================================================
    Note: Google Test filter = CollTraceUtilsTest*
    [==========] Running 4 tests from 1 test suite.
    [----------] Global test environment set-up.
    [----------] 4 tests from CollTraceUtilsTest
    [ RUN ] CollTraceUtilsTest.aggregateResultsTest
    [ OK ] CollTraceUtilsTest.aggregateResultsTest (0 ms)
    [ RUN ] CollTraceUtilsTest.EventQueueOperationTest
    [ OK ] CollTraceUtilsTest.EventQueueOperationTest (0 ms)
    [ RUN ] CollTraceUtilsTest.EventQueueMultiThreadTest
    [ OK ] CollTraceUtilsTest.EventQueueMultiThreadTest (0 ms)
    [ RUN ] CollTraceUtilsTest.getSizeMbTest
    [ OK ] CollTraceUtilsTest.getSizeMbTest (0 ms)
    [----------] 4 tests from CollTraceUtilsTest (0 ms total)
    [----------] Global test environment tear-down
    [==========] 4 tests from 1 test suite ran. (0 ms total)
    [ PASSED ] 4 tests.
    [ INFO ] Total executed cases: 0
    [ TIMING ] TEST SUITE : TEST NAME : TIME ms (STATUS)
    [ TIMING ] CollTraceUtilsTest : aggregateResultsTest: 0.00 sec (PASS)
    [ TIMING ] CollTraceUtilsTest : EventQueueOperationTest: 0.00 sec (PASS)
    [ TIMING ] CollTraceUtilsTest : EventQueueMultiThreadTest: 0.00 sec (PASS)
    [ TIMING ] CollTraceUtilsTest : getSizeMbTest : 0.00 sec (PASS)
    [ TIMING ] CollTraceUtilsTest : TOTAL : 0.00 sec (PASS)
    [ TIMING ] Total time: 0.00 minutes
  1. run the rccl tests,
    set RCCL_LATENCY_PROFILER=1 when running RCCL tests, in the NCCL debug files we could see information like below.
    Please note that in meta's internal environment, we have internal tools (scuba) which helps us to do aggregating, filtering and visualizations, but we do not have that in oss env, so we output to files. In the code, we leave a placeholder for integration into meta.

devgpu039:494282:495674 [0] NCCL INFO coll_id 0, percent 11, min_latency_us 88312.796875, max_latency_us 98300.320312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 1, percent 18, min_latency_us 80149.375000, max_latency_us 94813.242188, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 2, percent 18, min_latency_us 68449.570312, max_latency_us 81227.570312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 3, percent 5, min_latency_us 87649.710938, max_latency_us 92216.945312, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 4, percent 14, min_latency_us 88092.070312, max_latency_us 100858.929688, op_name AllReduce, data_type ncclFloat32, count 268435456, message_size_MB 1024.000000, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 5, percent 178, min_latency_us 9818.096680, max_latency_us 27337.013672, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 6, percent 101, min_latency_us 6809.101074, max_latency_us 13728.096680, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 7, percent 152, min_latency_us 5829.539062, max_latency_us 14748.134766, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 8, percent 138, min_latency_us 8587.885742, max_latency_us 20474.212891, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322
devgpu039:494282:495674 [0] NCCL INFO coll_id 9, percent 295, min_latency_us 3697.302979, max_latency_us 14604.616211, op_name AllReduce, data_type ncclFloat32, count 16, message_size_MB 0.000061, comm_hash 17664125919619314322

Additional Documentation:
Shared with AMD rccl team privately

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

#include "comm.h"
#include "CollTraceUtils.h"

namespace meta {
Copy link
Contributor
@thananon thananon Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already namespace facebook in our codebase and this one is introducing meta. I am not sure if this is a great idea for OSS. @corey-derochie-amd

Is it possible to remove company-specific namespace and change it to something generic?

Sign up for fre 8000 e to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0