Optimize slice handling to accelerate the large batch transfer operation #557

SCDESPERTATE · 2025-06-25T14:38:37Z

Motivation
In scenarios such as cases mentioned in add support for batch transfer to accelerate transfer operation #499 , where transfers involve a large batch size (thousands or more), but each chunk within the batch is relatively small (tens to hundreds of KiB), a substantial number of slices and work requests must be generated for the transfer. The current implementation introduces non-negligible latency due to the following issue:
- In RdmaTransport::submitTransferTask, all slices must be allocated before posting to RdmaContext::submitPostSend. When there are many requests, the volume of slices can overwhelm the ThreadLocalSliceCache, causing page faults and delaying transfer initiation.
- The TransferRequest length may not be a multiple of GlobalConfig::slice_size, leading to smaller final slices. Each slice becomes a separate work request, and when many small slices are created, overhead increases, reducing throughput.
Modification
- Submit the slice_list when the number of slices allocated by the new operator reaches a predefined watermark, even if there are still pending requests to be processed in RdmaTransport::submitTransferTask.
- Merge the final slice with the previous slice if its size is below a specified threshold.
Result
Run the Python script provided by add support for batch transfer to accelerate transfer operation #499
Since the modification occurs with RdmaTransport::submitTransferTask, two results have been merged for comparison as follow

============================================================================================
SUMMARY
============================================================================================
Test Case            wo-opt(s)      wo-opt(GB/s)   w-opt(GB/s)    w-opt(GB/s)    Speedup   
--------------------------------------------------------------------------------------------
200MB/5000chunks     0.008          27.921         0.007          32.471         16.29%    
200MB/10000chunks    0.011          19.145         0.008          24.819         29.63%    
300MB/8000chunks     0.012          26.849         0.010          32.463         20.90%    
400MB/10000chunks    0.015          27.521         0.013          33.013         19.95%    
500MB/15000chunks    0.021          25.069         0.017          30.392         21.23%    
600MB/12000chunks    0.021          30.130         0.018          34.885         15.78%    
700MB/20000chunks    0.029          25.571         0.024          31.191         21.97%    
700MB/10000chunks    0.025          29.294         0.019          38.352         30.91%    

Average Speedup: 22.08%
Maximum Speedup: 30.91%
Average Batch Throughput: 32.198 GB/s
Average Non-Batch Throughput: 26.438 GB/s

Result shows that the modification achieves 20%~30% boost in throughput.

…maTransport::submitTransferTask`

…&&slice related overhead

alogfans

Let @doujiang24 double-confirm it. LGTM.

SCDESPERTATE and others added 3 commits June 24, 2025 12:24

kick off transfer first when there are too many slices to post in `Rd…

440518b

…maTransport::submitTransferTask`

allow the last slice of a TransferRequest to be larger to reduce wr…

f1e38f4

…&&slice related overhead

Merge branch 'kvcache-ai:main' into main

1a94176

SCDESPERTATE changed the title ~~Optimize to accelerate the large batch transfer operation~~ Optimize slice handling to accelerate the large batch transfer operation Jun 25, 2025

SCDESPERTATE mentioned this pull request Jun 26, 2025

add support for asynchronous batch transfer to accelerate transfer operation #564

Open

add configs && docs explanation

dcc03a9

SCDESPERTATE marked this pull request as ready for review June 26, 2025 15:21

alogfans approved these changes Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize slice handling to accelerate the large batch transfer operation #557

Optimize slice handling to accelerate the large batch transfer operation #557

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimize slice handling to accelerate the large batch transfer operation #557

Are you sure you want to change the base?

Optimize slice handling to accelerate the large batch transfer operation #557

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!