Reply Copy Avoidance #1457

alexander-shabanov · 2024-12-18T12:47:06Z

Overview

This PR introduces the ability to avoid copying whole content of string object into replies (i.e. bulk string replies) and to allow I/O threads refer directly to obj->ptr in writev iov as described at #1353.

Key Changes

Added capability to reply construction allowing to interleave regular replies with copy avoid replies in client reply buffers
Extended write-to-client handlers to support copy avoid replies
Added copy avoidance of string bulk replies when copy avoidance indicated by I/O threads
Minor changes in cluster slots stats in order to support network-bytes-out for copy avoid replies
Copy avoidance is beneficial for performance despite object size only starting certain number of threads. So it will be enabled only starting certain number of threads. Internal configuration min-io-threads-copy-avoid introduced to manage this number of threads

Note: When copy avoidance disabled content and handling of client reply buffers remains as before this PR

Implementation Details

`client` and `clientReplyBlock` structs:

buf_encoded flag has been added to clientReplyBlock struct and to client struct for static c->buf to indicate if reply buffer is in copy avoidance mode (i.e. include headers and payloads) or not (i.e. plain replies only).
io_last_written_buf, io_last_written_bufpos, io_last_written_data_len fields added client struct to to keep track of write state between writevToClient invocations

Reply construction:

Original _addReplyToBuffer and _addReplyProtoToList have been renamed to _addReplyPayloadToBuffer and _addReplyPayloadToList and extended to support different types of payloads - regular replies and copy avoid replies.
New _addReplyToBuffer and _addReplyProtoToList calls now _addReplyPayloadToBuffer and _addReplyPayloadToList and used for adding regular replies to client reply buffers.
Newly introduced _addBulkOffloadToBuffer and _addBulkOffloadToList are used for adding copy avoid replies to client reply buffers.

Write-to-client infrastructure:

The writevToClient and _postWriteToClient has been significantly changed to support copy avoidance capability.

Internal configuration:

min-io-threads-avoid-copy-reply - Minimum number of IO threads for copy avoidance
min-string-size-avoid-copy-reply - Minimum bulk string size for copy avoidance when IO threads disabled
min-string-size-avoid-copy-reply-threaded - Minimum bulk string size for copy avoidance when IO threads enabled

Testing

Existing unit and integration tests passed. Copy avoidance enabled on tests with --io-threads flag
Added unit tests for copy avoidance functionality

Performance Tests

Note: pay attention io-threads 1 config means only main thread with no additional io-threads, io-threads 2 means main thread plus 1 I/O thread, io-threads 9 means main thread plus 8 I/O threads.

512 byte object size

Tests are conducted on memory optimized instances using:

3,000,000 keys
512 bytes object size
1000 clients

io-threads (including main thread)	Plain Reply	Copy Avoidance
7	1,160,000	1,160,000
8	1,150,000	1,280,000
9	1,150,000	1,330,000
10	N/A	1,380,000
11	N/A	1,420,000

Various object size, small number of threads

iothreads	Data size	Keys	Clients	Instance type	Unstable branch	Copy Avoidance On
1	512 byte	3,000,000	1,000	memory optimized	195,000	195,000
2	512 byte	3,000,000	1,000	memory optimized	245,000	245,000
3	512 byte	3,000,000	1,000	memory optimized	455,000	459,000
4	512 byte	3,000,000	1,000	memory optimized	685,000	685,000

1	1K	3,000,000	1,000	memory optimized	185,000	185,000
2	1K	3,000,000	1,000	memory optimized	235,000	235,000
3	1K	3,000,000	1,000	memory optimized	450,000	450,000

1	4K	1,000,000	1,000	network optimized	182,000	187,000
2	4K	1,000,000	1,000	network optimized	240,000	238,000

1	16K	1,000,000	500	network optimized	100,000	120,000
2	16K	1,000,000	500	network optimized	140,000	140,000
3	16K	1,000,000	500	network optimized	275,000	260,000

1	32K	500,000	500	network optimized	57,000	90,000
2	32K	500,000	500	network optimized	110,000	110,000
3	32K	500,000	500	network optimized	215,000	215,000

1	64K	100,000	500	network optimized	30,000	57,000
2	64K	100,000	500	network optimized	69,000	61,000
3	64K	100,000	500	network optimized	120,000	120,000
4	64K	100,000	500	network optimized	115,000 - 175,000	175,000
5	64K	100,000	500	network optimized	115,000 - 165,000	230,000

valkey.conf

src/config.c

src/networking.c

madolson · 2024-12-18T17:07:49Z

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

madolson

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

src/networking.c

zuiderkwast · 2024-12-18T18:23:13Z

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

alexander-shabanov · 2024-12-19T12:30:29Z

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

Fixed

alexander-shabanov · 2024-12-20T15:35:21Z

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to obj->ptr by main thread and much much less from eliminating copy to reply buffers. Without I/O threads, main thread still need to access obj->ptr and writev flow is a bit slower (requires additional preparation work) than plain write flow. I will publish results of various tests with and without I/O threads and with different data sizes on next week.

alexander-shabanov · 2024-12-20T15:41:24Z

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread (eliminating copy much less important).

zuiderkwast · 2024-12-20T17:17:12Z

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

alexander-shabanov · 2024-12-23T14:18:48Z

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

Published results of performance tests in the PR description

src/unit/test_networking.c

src/cluster_slot_stats.c

src/io_threads.c

src/networking.c

zuiderkwast · 2024-12-25T11:18:12Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

alexander-shabanov · 2024-12-26T07:38:45Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

ranshid · 2025-01-02T14:19:08Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.
This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.
We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have.
I can think of some:

Enable this feature only when the number of active IO-Threads is 8
Track the CPU consumption on the engine thread and synamically enable the feature when the main engine is utilizing high CPU
Maybe we can find a way to tag the reply object should be offloaded so that we will not get the memory access penalty
And I am sure there are more.

src/unit/test_networking.c

src/cluster_slot_stats.c

uriyage · 2025-01-02T16:24:54Z

src/io_threads.c

-        c->io_last_bufpos = ((clientReplyBlock *)listNodeValue(c->io_last_reply_block))->used;
+        clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block);
+        c->io_last_bufpos = block->used;
+        /* If reply offload enabled force new header */


Maybe we should indeed check if reply offload is enabled before writing NULL to the header? Checking a global is cheaper than write access to the block

Anyway, with or without reply offload, main thread just finished to write to this block

src/networking.c

uriyage · 2025-01-02T18:00:23Z

src/networking.c

+}
+
+static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) {
+    if (last_header->type == type && last_header->slot == slot) {


When using for example MGET with small values that are not offloaded and come from different slots, wouldn't this cause multiple plain headers, thus requiring writeV instead of a simple write? We should investigate if this causes any performance degradation

In CMD slot will be always equal to -1 (see if (!clusterSlotStatsEnabled(slot)) slot = -1; in upsertPayloadHeader)
In CME MGET keys must map to the same slot.

It does mean that some plans to support cross slot operations in CME would have to consider this as well.
I agree we can consider this to be solved when the time comes. probably per slot accounting will have to be refactored a bit in order to support cross slot commands.

zuiderkwast · 2025-01-09T14:59:10Z

Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically.

Second topic: There are three code paths worth considering.

IO threads >= N. To avoid a memory access, we don't want to check the string size. Offload to IO thread.
IO threads not active. Short string reply. The main thread copies the string to reply buffer.
IO threads not active. Long string reply. When the main thread is about to copy the string to the reply buffer, it can see that the length is >= M bytes long (size threshold) and switch to this feature to avoid copying the string.

Case 3 is a very powerful optimization for long strings. Some users store large data in strings.

IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads.

Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K.

alexander-shabanov · 2025-01-13T10:33:17Z

Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically.

Second topic: There are three code paths worth considering.
1. IO threads >= N. To avoid a memory access, we don't want to check the string size. Offload to IO thread.

2. IO threads not active. Short string reply. The main thread copies the string to reply buffer.

3. IO threads not active. Long string reply. When the main thread is about to copy the string to the reply buffer, it can see that the length is >= M bytes long (size threshold) and switch to this feature to avoid copying the string.
Case 3 is a very powerful optimization for long strings. Some users store large data in strings.

IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads.

Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K.

@zuiderkwast We discussed it and going to propose in this PR to perform reply offload according to static number of I/O threads (i.e. io-threads config). Static number of I/O threads is more preferable than active I/O threads because it makes all the tests and troubleshooting of potential issues much more deterministic. I am working on final code changes and tests.

The reply offload activation according to size of object is deferred to another PR.

The best solution will be to perform reply offload according to actual main thread load matching the condition where reply offload is beneficial. However, it requires much more complex research.

alexander-shabanov · 2025-01-13T10:35:05Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.
This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.
We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have. I can think of some:
1. Enable this feature only when the number of active IO-Threads is 8

2. Track the CPU consumption on the engine thread and synamically enable the feature when the main engine is utilizing high CPU

3. Maybe we can find a way to tag the reply object should be offloaded so that we will not get the memory access penalty
   And I am sure there are more.

@ranshid please see this reply

ranshid · 2025-01-13T11:25:13Z

Thank you @alexander-shabanov . I think we need to be careful about this PR as this is near the release date. I think it is good we will focus on introducing a simple optimization and later focus on better optimization for dynamic support or large strings when io-threads are disabled.

ranshid · 2025-01-13T18:12:40Z

@alexander-shabanov / @uriyage following the discussion with the maintainers let's try to answer some of the followup questions:

In the performance results we see degradation when using small values which is "mitigated" when the number of io-threads is increased and we get better results when we offload the reply when the number of io-threads is higher than N. I suppose this is explained since the io-threads are not prefetching the values, so when the number of io-thraeds is small they are the bottleneck and thus we get degraded performance while when the number of io-threads is high the bottleneck shifts back to the engine - can you please verify that?
Is does seem the better solution is to make dynamic decision based on the string size. IMO in order to better support it we would need to make use of the alternative solution (tagging the clientReplyBlock). I recall there were some concerns going with this option but would be happy if you can revive some benchmark results and bring more data.

zuiderkwast

Partial review. I can't continue right now so I'm posting my pending comments. I've reviewed to the middle of networking.c to just before the iovec definitions.

I'm convinced this feature is worth the complexity.

src/server.h

src/config.c

src/io_threads.c

src/networking.c

zuiderkwast · 2025-02-03T13:09:31Z

src/networking.c

+    return _addReplyPayloadToBuffer(c, s, len, CLIENT_REPLY_PLAIN);
+}
+
+static size_t _addBulkOffloadToBuffer(client *c, const void *payload, size_t len) {


Offload or CopyAvoid?

Let's think about the terminology for a bit. Maybe we can find some terms that are even more explicit about what the buffer contains.

We have three type of buffer content and we should ideally use different words for each of them:

unencoded buffer with PLAIN content, no payload headers

encoded buffer, payload header type PLAIN, plain content

encoded buffer, payload header type COPY-AVOID, content is a sequence of bulkCopyAvoid objects.

The overloaded use of the word PLAIN is a source of confusion here. I suggest we use PLAIN for unencoded buffer and another word such as INLINE for the payload header type for encoded buffer with plain content.

The bulkCopyAvoid struct is actually a reference to the object. That's why we increment the reference counter. 😀 So maybe we should use the word "reference" for this kind of buffer content, e.g. _addBulkRefToBuffer, and the payload header type can be CLIENT_REPLY_REF? Or use the word INDIRECT?

Please also add a short comment above each of these functions to explain what the function name itself doesn't explain, like what it means that if it returns 0, etc.

And one more thing. I think this function can take const bulkCopyAvoid * instead of void *payload and size_t len. Where this function calls _addReplyPayloadToBuffer we can spell out sizeof(bulkCopyAvoid) for the length argument.

Streamlined terminology in the code according to the suggestion

zuiderkwast · 2025-02-03T13:21:18Z

src/networking.c

@@ -1127,8 +1277,18 @@ void addReplyBulkLen(client *c, robj *obj) {
    _addReplyLongLongWithPrefix(c, len, '$');
 }

+int tryOffloadBulkReply(client *c, robj *obj) {


Is this used only in this file? Make it static. (Shows the intent that it's a local function and it let's the compiler warn about unused functions.)

Rename "Offload"?

Added static and renamed

zuiderkwast · 2025-02-03T13:28:53Z

src/networking.c

 /* Adds the reply to the reply linked list.
 * Note: some edits to this function need to be relayed to AddReplyFromClient. */
-void _addReplyProtoToList(client *c, list *reply_list, const char *s, size_t len) {
+static void _addReplyPayloadToList(client *c, list *reply_list, const char *payload, size_t len, uint8_t payload_type) {


The comment above belongs to _addReplyProtoToList I think...?

Please move it and write a more correct comment about _addReplyPayloadToList.

The comment belongs to original function that became to be _addReplyPayloadToList. Updated the comment to be more relevant

zuiderkwast · 2025-02-03T13:30:11Z

src/networking.c

@@ -484,6 +621,18 @@ void _addReplyToBufferOrList(client *c, const char *s, size_t len) {
    if (len > reply_len) _addReplyProtoToList(c, c->reply, s + reply_len, len - reply_len);
 }

+void _addBulkOffloadToBufferOrList(client *c, robj *obj) {


Rename?

Add a short one-line comment as documentation.

Make it static. It's what we've decided to use for local functions. (Keep the underscore prefix if you want. I don't care. It's used in this file but not in other files.)

Renamed and added comment

src/memory_prefetch.c

uriyage · 2025-02-05T10:40:01Z

src/networking.c

+ * Each chunk contains header followed by payload */
+typedef struct __attribute__((__packed__)) payloadHeader {
+    size_t len;        /* payload length in a reply buffer */
+    size_t actual_len; /* actual reply length for non-plain payloads */


Consider explicitly state in the comment that this represents the bulk string length plus the RESP protocol overhead and maybe rename to actual_reply_len

wanted to make it more generic for future copy avoidance use cases if any. what do you think? should we make it now very specific?

We can always change it in the future.

Agree with Uri.
Also maybe change it to:
size_t reply_encoded_len;
size_t reply_len;
?

src/networking.c

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

ranshid

I only got till the point of the writev changes. so very imcomplete. pushing my comments up till here

ranshid · 2025-02-19T13:29:40Z

src/config.c

+    createIntConfig("min-io-threads-avoid-copy-reply", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_io_threads_copy_avoid, 7, INTEGER_CONFIG, NULL, NULL),
+    createIntConfig("min-io-threads-value-prefetch-off", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_io_threads_value_prefetch_off, 10, INTEGER_CONFIG, NULL, NULL),
+    createIntConfig("min-string-size-avoid-copy-reply", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_string_size_copy_avoid, 16384, INTEGER_CONFIG, NULL, NULL),
+    createIntConfig("min-string-size-avoid-copy-reply-threaded", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_string_size_copy_avoid_threaded, 65536, INTEGER_CONFIG, NULL, NULL),


since these are internal, can we paste some comment explaining what each of these control?

They are commented thoroughly inside server struct same as most config parameters

ranshid · 2025-02-19T13:39:41Z

src/server.h

 /* This structure is used in order to represent the output buffer of a client,
 * which is actually a linked list of blocks like that, that is: client->reply. */
 typedef struct clientReplyBlock {
    size_t size, used;
+    payloadHeader *last_header;
+    union {
+        uint8_t raw_flag;


are we really using this raw_flag to zero down the flags? or we just added this as a dead code for future needs?
If we do so I would suggest use it when allocating a new buffer, if not we can just drop it for now.

ranshid · 2025-02-19T13:46:31Z

src/server.h

 /* This structure is used in order to represent the output buffer of a client,
 * which is actually a linked list of blocks like that, that is: client->reply. */
 typedef struct clientReplyBlock {
    size_t size, used;
+    payloadHeader *last_header;


can we document this field? the use of it and mention the fact that it is expected to be NULL if we are not encoded
Also I find it confusing that we use the same name in the client and the replyBlock. I understand the client currently has 2 ways to operate (with the obuf and the reply list) but maybe consider naming them accordingly ie in the client:
buf_last_heaser
in the replyBlock:
block_last_header

Added /* points to a last header in an encoded buffer */ comment to last_header. More detailed/specific comments located in the code according to usage

Prefer to leave last_header name same as we have buf field in client and in clientReplyBlock structs.

src/networking.c

ranshid · 2025-02-19T13:58:19Z

src/networking.c

+ * Each chunk contains header followed by payload */
+typedef struct __attribute__((__packed__)) payloadHeader {
+    size_t len;        /* payload length in a reply buffer */
+    size_t actual_len; /* actual reply length for non-plain payloads */


Agree with Uri.
Also maybe change it to:
size_t reply_encoded_len;
size_t reply_len;
?

ranshid · 2025-02-19T14:08:43Z

src/memory_prefetch.c

@@ -119,6 +120,10 @@ static void prefetchEntry(KeyPrefetchInfo *info) {
    if (hashtableIncrementalFindStep(&info->hashtab_state) == 1) {
        /* Not done yet */
        moveToNextKey();
+    } else if (server.io_threads_num >= server.min_io_threads_value_prefetch_off) {


IMO the use of another configuration just to have better fine grinned control over this logic is not needed at this stage. I suggest use the min-io-threads-avoid-copy-reply and remove this config for now.

Applied suggestion

ranshid · 2025-02-19T14:09:25Z

src/networking.c

+} payloadHeader;
+
+/* To avoid copy of whole string in reply buffer
+ * we store store pointers to object and string itself */


Suggested change

* we store store pointers to object and string itself */

* we store pointers to object and string itself */

src/networking.c

ranshid · 2025-02-19T14:20:31Z

src/networking.c

+    size_t len = sdslen(obj->ptr);
+
+    /* Main thread only. No I/O threads */
+    if (server.io_threads_num == 1) {
+        /* Copy avoidance is preferred starting certain string size */
+        return server.min_string_size_copy_avoid && len >= (size_t)server.min_string_size_copy_avoid;
+    }
+    /* Main thread + I/O threads */
+    return server.min_string_size_copy_avoid_threaded && len >= (size_t)server.min_string_size_copy_avoid_threaded;


one reason I can think someone choose to disable length check is to avoid access the string.
however in your case it is not possible since you first access the string (sdslen) and then decide if to check for it. I suggest make the min_string_size_copy_avoid and min_string_size_copy_avoid_threaded enable check before we even access the string.

If copy avoidance is not selected by number of threads (i.e. sserver.io_threads_num >= server.min_io_threads_copy_avoid) the code will access string sooner or later anyway. However, applying suggestion to sharp correctness

ranshid · 2025-02-19T15:14:55Z

src/networking.c

+    return _addReplyPayloadToBuffer(c, s, len, PLAIN_REPLY);
+}
+
+/* Adds bulk string reference (i.e. pointer to object and pointer to string itself) to static buffer


Suggested change

/* Adds bulk string reference (i.e. pointer to object and pointer to string itself) to static buffer

/* Adds bulk string reference (i.e. pointer to object and pointer to string itself) to client buffer

static buffer terminology comes from genuine (original) code. For example, see https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L380
Let me know if client buffer is preferred terminology and I will be glad to fix to something accepted / understandable by majority.

static is just wrong. I know it was an historical reasoning behind it but no need to keep bad habbits

ranshid

Some more comments. still did not review the tests

src/networking.c

ranshid · 2025-02-20T08:44:51Z

src/networking.c

+    /* Auxiliary fields for scattering BUFSTR_REF chunks from encoded buffers */
+    int prfxcnt;                                    /* number of prefixes */
+    char (*prefixes)[BULK_STR_LEN_PREFIX_MAX_SIZE]; /* bulk string prefixes */
+    char *crlf;                                     /* bulk string suffix */


probably redundant. crlf is... crlf we can probably keep a constant

you are right it can be constant. tried to use static const char crlf[2] = {'\r', '\n'}; instead of reply->crlf in addPlainBufferToReplyIOV(reply->crlf, 2, reply, metadata);
Surprisingly, it lowers performance a little bit. Maybe due to memory access difference.
Leaving crlf as-is

how much did the performance was impacted?

What was the result of this. I'm also surprised a constant was slower than referencing a variable.

ranshid · 2025-02-20T12:22:27Z

src/networking.c

+        /* RESP encodes bulk strings as $<length>\r\n<data>\r\n */
+        char *prefix = reply->prefixes[reply->prfxcnt];
+        prefix[0] = '$';
+        size_t num_len = ll2string(prefix + 1, sizeof(reply->prefixes[0]) - 3, str_len);
+        prefix[num_len + 1] = '\r';
+        prefix[num_len + 2] = '\n';


note: it is possible to consider using the shared.bulkhdr here in case the string len is smaller than OBJ_SHARED_BULKHDR_LEN. I guess this code is complex enough as it is already, but maybe add some TODO here?

Let's not complicate it without a real need (benefit?)

at least consider make use of the passed crlf here. IMO it just looks messy

ranshid · 2025-02-20T12:24:04Z

src/networking.c

+    c->io_last_written_data_len = (size_t)(metadata[last].data_len + remaining);
+}
+
+void proceedToUnwritten(replyIOV *reply, int nwritten) {


lets place some comment explaining this function and where it will be used.

Ironically, name of this function comes from the comment in the original code that has been refactored.
Adding comment to to explain purpose

Yeah, when it's moved out to a separate function, the local context is lost, so some more context in the comment could be useful. When the code was within writevToClient, you read it from top to bottom, then I guess the short comment was enough.

src/networking.c

ranshid · 2025-02-20T12:57:45Z

src/server.h

+    char *io_last_written_buf;           /* Last buffer that has been written to the client connection
+            
10000
                              * Last buffer is either c->buf or c->reply list node (i.e. buf from a clientReplyBlock) */
+    size_t io_last_written_bufpos;       /* The buffer has been written until this position */
+    size_t io_last_written_data_len;     /* The actual length of the data written from this buffer


since these all being tracked together, I suggest maybe pack them in a dedicated struct eg struct WrittenBuf last_written_buf or something

Applied suggestion

src/networking.c

ranshid · 2025-02-20T13:15:25Z

src/networking.c

@@ -2195,44 +2542,100 @@ int _writeToClient(client *c) {
    }

    c->nwritten = tot_written;
+    if (tot_written > 0) {
+        c->io_last_written_buf = c->buf;


Question - this is saved here, but the c->buf can be relocated by clientsCronResizeOutputBuffer is that handled somehow?

Added protection to clientsCronResizeOutputBuffer

ranshid · 2025-02-20T13:16:42Z

src/replication.c

@@ -889,6 +889,8 @@ int primaryTryPartialResynchronization(client *c, long long psync_offset) {
     * 4) Send the backlog data (from the offset to the end) to the replica. */
    waitForClientIO(c);
    c->flag.replica = 1;
+    serverAssert(c->bufpos == 0);


lets comment on why this assertion is true and what it is here to ensure

It is redundant after reply_offload flag has been removed from the client level. I remove this assert and c->flag.buf_encoded = 0; from replication.c

ranshid · 2025-02-20T13:16:57Z

src/replication.c

@@ -1155,6 +1157,8 @@ void syncCommand(client *c) {
    if (server.repl_disable_tcp_nodelay) connDisableTcpNoDelay(c->conn); /* Non critical if it fails. */
    c->repl_data->repldbfd = -1;
    c->flag.replica = 1;
+    serverAssert(c->bufpos == 0);


lets comment on why this assertion is true and what it is here to ensure

It is redundant after reply_offload flag has been removed from the client level. I remove this assert and c->flag.buf_encoded = 0; from replication.c

ranshid

some minor comments about the tests

tests/unit/replybufsize.tcl

tests/unit/maxmemory.tcl

tests/instances.tcl

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

ranshid

O.K generally LGTM

ranshid · 2025-03-17T16:35:34Z

src/networking.c

+    return _addReplyPayloadToBuffer(c, s, len, PLAIN_REPLY);
+}
+
+/* Adds bulk string reference (i.e. pointer to object and pointer to string itself) to static buffer


static is just wrong. I know it was an historical reasoning behind it but no need to keep bad habbits

ranshid · 2025-03-17T16:36:56Z

src/networking.c

+    /* Auxiliary fields for scattering BUFSTR_REF chunks from encoded buffers */
+    int prfxcnt;                                    /* number of prefixes */
+    char (*prefixes)[BULK_STR_LEN_PREFIX_MAX_SIZE]; /* bulk string prefixes */
+    char *crlf;                                     /* bulk string suffix */


how much did the performance was impacted?

ranshid · 2025-03-17T16:45:33Z

src/networking.c

+        /* RESP encodes bulk strings as $<length>\r\n<data>\r\n */
+        char *prefix = reply->prefixes[reply->prfxcnt];
+        prefix[0] = '$';
+        size_t num_len = ll2string(prefix + 1, sizeof(reply->prefixes[0]) - 3, str_len);
+        prefix[num_len + 1] = '\r';
+        prefix[num_len + 2] = '\n';


at least consider make use of the passed crlf here. IMO it just looks messy

src/networking.c

tests/instances.tcl

tests/unit/maxmemory.tcl

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

madolson · 2025-04-29T00:03:03Z

tests/unit/commandlog.tcl

+
+        # Restore min-io-threads-avoid-copy-reply value
+        r config set min-io-threads-avoid-copy-reply $copy_avoid
+    } {OK} {needs:debug}


Suggested change

} {OK} {needs:debug}

} {OK}

These don't look like they need debug, maybe an earlier iteration needed it?

madolson · 2025-04-29T00:13:31Z

src/server.h

-    int original_argc;    /* Num of arguments of original command if arguments were rewritten. */
-    robj **original_argv; /* Arguments of original command if arguments were rewritten. */
+    size_t bufpos;
+    payloadHeader *last_header; /* Pointer to the last header in a buffer in reply offload mode */


Suggested change

payloadHeader *last_header; /* Pointer to the last header in a buffer in reply offload mode */

payloadHeader *last_header; /* Pointer to the last header in a buffer when using copy avoidance */

presumably?

alexander-shabanov mentioned this pull request Dec 18, 2024

[NEW] Reply Offload #1353

Open

alexander-shabanov force-pushed the reply_offload branch from c2d1a60 to a0a156c Compare December 18, 2024 15:18

madolson reviewed Dec 18, 2024

View reviewed changes

valkey.conf Outdated Show resolved Hide resolved

madolson reviewed Dec 18, 2024

View reviewed changes

src/config.c Outdated Show resolved Hide resolved

madolson reviewed Dec 18, 2024

View reviewed changes

src/networking.c Show resolved Hide resolved

madolson reviewed Dec 18, 2024

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

alexander-shabanov force-pushed the reply_offload branch from db824f4 to 04e41c1 Compare December 19, 2024 12:26

alexander-shabanov closed this Dec 19, 2024

alexander-shabanov reopened this Dec 19, 2024

alexander-shabanov force-pushed the reply_offload branch 2 times, most recently from ac7e1f5 to a40e72e Compare December 19, 2024 14:03

alexander-shabanov force-pushed the reply_offload branch from a40e72e to cff89de Compare December 24, 2024 07:25

uriyage reviewed Dec 24, 2024

View reviewed changes

src/unit/test_networking.c Outdated Show resolved Hide resolved

src/cluster_slot_stats.c Outdated Show resolved Hide resolved

src/io_threads.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

alexander-shabanov force-pushed the reply_offload branch from cff89de to 72be14b Compare December 25, 2024 08:23

uriyage reviewed Jan 2, 2025

View reviewed changes

alexander-shabanov force-pushed the reply_offload branch 4 times, most recently from e5342fc to b9173ee Compare January 30, 2025 15:16

alexander-shabanov changed the title ~~Reply offload~~ Copy Avoidance Jan 30, 2025

alexander-shabanov force-pushed the reply_offload branch from b9173ee to 2cfbc3b Compare February 2, 2025 10:03

zuiderkwast reviewed Feb 3, 2025

View reviewed changes

alexander-shabanov force-pushed the reply_offload branch from 2cfbc3b to 96571b5 Compare February 5, 2025 10:12

zuiderkwast reviewed Feb 5, 2025

View reviewed changes < F438 /div>

src/memory_prefetch.c Show resolved Hide resolved

src/memory_prefetch.c Show resolved Hide resolved

uriyage reviewed Feb 5, 2025

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

alexander-shabanov force-pushed the reply_offload branch from 96571b5 to 73db90f Compare February 5, 2025 14:19

uriyage approved these changes Feb 5, 2025

View reviewed changes

alexander-shabanov added 4 commits February 12, 2025 09:14

Reply offload

abe9f4a

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

addressed PR comments

34df510

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

test_networking fix

2aee972

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

clang fix

88901ea

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

alexander-shabanov force-pushed the reply_offload branch from 4234a3e to 88901ea Compare February 12, 2025 09:15

Merge remote-tracking branch 'upstream/unstable' into reply_offload

79862e6

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

ranshid reviewed Feb 19, 2025

View reviewed changes

ranshid reviewed Feb 20, 2025

View reviewed changes

tests/unit/replybufsize.tcl Show resolved Hide resolved

tests/unit/maxmemory.tcl Show resolved Hide resolved

tests/instances.tcl Show resolved Hide resolved

zuiderkwast changed the title ~~Copy Avoidance~~ Reply Copy Avoidance Feb 24, 2025

addressed PR comments, part2

1f68c68

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

ranshid approved these changes Mar 20, 2025

View reviewed changes

alexander-shabanov added 3 commits March 25, 2025 08:51

fix test_networking

730abf4

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

Merge remote-tracking branch 'upstream/unstable' into reply_offload

61d58d2

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

clang fix

724bfee

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>

madolson reviewed Apr 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reply Copy Avoidance #1457

Reply Copy Avoidance #1457

	* we store store pointers to object and string itself */
	* we store pointers to object and string itself */

	/* Adds bulk string reference (i.e. pointer to object and pointer to string itself) to static buffer
	/* Adds bulk string reference (i.e. pointer to object and pointer to string itself) to client buffer

	payloadHeader last_header; / Pointer to the last header in a buffer in reply offload mode */
	payloadHeader last_header; / Pointer to the last header in a buffer when using copy avoidance */

Reply Copy Avoidance #1457

Are you sure you want to change the base?

Reply Copy Avoidance #1457

Conversation

Overview

Key Changes

Implementation Details

client and clientReplyBlock structs:

Reply construction:

Write-to-client infrastructure:

Internal configuration:

Testing

Performance Tests

512 byte object size

Various object size, small number of threads

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`client` and `clientReplyBlock` structs: