Apply transaction batches in periodic intervals. #4504

mtrippled · 2023-04-21T02:06:38Z

Add new transaction submission API field, "sync_mode", which determines behavior of the server while submitting transactions: 1) sync (default): Process transactions in a batch immediately,
and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
interval and return immediately.
3) wait: Put transaction into the batch for the next processing
interval and return only after it is processed.

This PR is related to 2 others that, when combined, increase throughput significantly:
#4503
#4505

High Level Overview of Change

This improves transaction throughput. For background, transactions are applied to the open ledger in batches. Only one batch is applied at a time. As they are received from either a client or a peer, transactions are added to a batch. If no batches are being processed, then the current batch is processed immediately. Otherwise, it will be processed once the current batch completes. Batches are applied continuously until no more transactions are queued this way. This pattern optimizes both throughput and latency, but only if applying batches does not contend with other activities.

However, the problem is that batch application contends with numerous other activities for the MasterLock and the LedgerMaster lock. As transaction volume increases, so does lock contention. Analysis under heavy transaction load shows that a large amount of time is spent in each transaction batch setting up the open ledger for modification. However, each individual transaction takes a very small amount of time. More im 8000 portantly, duration preparing to modify the ledger is not affected by the size of the batch! Instead, this duration is related to the amount of transactions in the current open ledger--as transaction volume increases, so does the time it takes to apply each batch.

For example, assume it takes 5ms to prepare the ledger for each batch, and 50us per transaction. Minimizing wall clock time, and therefore lock contention, means minimizing the number of batches. To put in perspective, single transactions submitted just under 5ms apart would consume all available wall clock time, with the vast majority simply preparing the open ledger for modification! That's not a problem if our only workload is applying transactions. But other things need to use the lock, also.

The solution implemented instead attempts to apply batches approximately every 100ms, or 10 times per second. To contrast with the above example, reducing the number of batches this way would reduce open ledger preparation time from nearly 1 full second to only 50ms, while actual transaction processing is a trivial 10ms. That's a 94% reduction in lock contention! That's a contrived example, of course, but it plays out in testing--as transaction volume increases, lock contention decreases and the server is able to process significantly higher volume.

Along with this fix is an enhancement to the submission API. Namely, the existing behavior is to immediately process transactions as they are submitted by the client (but not the peer). However, this tends to diminish the effectiveness of the fix as volume submitted directly to the server increases. This is because more batches are being applied. The problem exhibits itself only under very high volume, but if not addressed will cause problems as livenet volume increases. The API enhancement creates an optional new field called "sync_mode" with the following possible settings:
1) sync (default): Process transactions in a batch immediately,
and return only once the transaction has been processed.
2) async: Put transaction into the batch for the next processing
interval and return immediately. If successful, return a new code: terSUBMITTED.
3) wait: Put transaction into the batch for the next processing
interval and return only after it is processed.

Trade-offs for each option are as follows:

sync: This is identical to existing behavior, so no users will be suprised by this as long as this stays the default. However, it will lead to instability of the server as described above.
async: This returns immediately, and is actually faster for the client. However, transaction submission errors that occur only once the transaction is applied will be unknown to the client. This should be used for performance testing.
wait: This is the slowest for the client, but it allows the server to keep up with the network as if the transaction had been submitted "async".

Context of Change

Type of Change

New feature (non-breaking change which adds functionality)
Tests (You added tests for code that already exists, or your new feature included in this PR)

ximinez · 2023-04-24T19:30:33Z

Roughly how much slower is it for the client if they submit with wait?

mtrippled · 2023-04-24T19:32:39Z

Roughly how much slower is it for the client if they submit with wait?

Each batch takes place 100ms after the preceding. So the average additional wait time should be about 50ms.

ximinez · 2023-04-25T19:52:42Z

Each batch takes place 100ms after the preceding. So the average additional wait time should be about 50ms.

Yeah, duh. I should have realized that. But anyway, the reason I ask is that I suspect most clients aren't going to notice an extra 50ms (or maybe less because they're already paying the overhead of preparing the open ledger). Why not make the default behavior to wait? Or is that something planned for the future once we have data about how well it works in the real world?

And while I'm talking about ideas for the future, it might be worthwhile to make the default configurable, and even to disallow certain modes for non-admin connections. I'm thinking about public nodes who might want to reduce their load by not letting anybody submit with sync.

mtrippled · 2023-04-25T20:24:46Z

Each batch takes place 100ms after the preceding. So the average additional wait time should be about 50ms.

Yeah, duh. I should have realized that. But anyway, the reason I ask is that I suspect most clients aren't going to notice an extra 50ms (or maybe less because they're already paying the overhead of preparing the open ledger). Why not make the default behavior to wait? Or is that something planned for the future once we have data about how well it works in the real world?

And while I'm talking about ideas for the future, it might be worthwhile to make the default configurable, and even to disallow certain modes for non-admin connections. I'm thinking about public nodes who might want to reduce their load by not letting anybody submit with sync.

I want to keep "sync" the default for now to not surprise anybody. The necessity to phase this out and to either use "wait" or "async" will happen as volumes increase to well over 2000/s based on the testing I've done. Consider that if a current tx submission takes 10ms from the client's perspective, then adding 50ms to that will decrease throughput from 100/s to about 17/s. That's for somebody doing a lot of transaction submission.

I think the idea of allowing administrators to disable certain modes is a good idea--I can see somebody like XRPLF wanting to do that, as well as Ripple. But in practice would also mean updating clients to be wise to this change. Maybe refinements such as this can be something the broad community can debate also?

src/ripple/app/misc/NetworkOPs.cpp

src/ripple/rpc/impl/RPCHelpers.cpp

src/ripple/app/misc/NetworkOPs.h

src/ripple/core/Config.h

HowardHinnant · 2023-05-09T17:32:34Z

src/ripple/protocol/jss.h

@@ -562,6 +562,7 @@ JSS(sub_index);             // in: LedgerEntry
 JSS(subcommand);            // in: PathFind
 JSS(success);               // rpc
 JSS(supported);             // out: AmendmentTableImpl
+JSS(sync);                  // in: Submit


If I am correct that this becomes part of our client-facing API, we should make sure this is the name we want. I find myself wondering if another name, such as submit or maybe submit_mode would be better. I'm not the right person to answer this question. I'm simply trying to make this decision more visible. If sync is the best name, I'm fine with that.

@intelliot what's the best way to define this new API field? I like it as is, since the behavior has to do with being synchronous. But I also don't care enough either way. How should we handle this?

Can you (re)explain the purpose and meaning of the field?

Why does it need to be user-facing?

How should users decide what value to set in the field?

Once we have simplified and accurate answers to the above, then we should collect feedback from API users like @justinr1234, @mvadari, @mDuo13 to get a recommendation for how to define and introduce the field.

@mDuo13 can you please review the proposed API change, described at the top of this PR?

@ximinez what about "sync_mode"?

Sync mode sounds confusing to me. I prefer mode or submit_mode.

I could see the value in sync_mode. Like I said earlier, we already know that it's related to submit, but what happens if down the road there's some other processing option that needs a separate mode. I have no idea what that would be, but you never know. If we do sync_mode now, then it'll be less confusing to add foo_mode later.

"sync_mode" means that the particular mode defines the behavior having to do with synchronicity.
"mode" can be anything and "submit_mode" is redundant. On the other hand, the interface is something that strikes everybody differently. What do you guys think at this point? @justinr1234 @ximinez

sync_mode works for me.

src/ripple/app/misc/NetworkOPs.cpp

HowardHinnant

Left a few comments, but no blockers.

ximinez

I left several suggestions, but I like the changes overall.

src/ripple/app/misc/NetworkOPs.cpp

src/ripple/app/misc/NetworkOPs.h

src/ripple/core/Config.h

src/ripple/overlay/impl/PeerImp.cpp

src/ripple/rpc/handlers/SubmitMultiSigned.cpp

ximinez · 2023-06-14T23:05:53Z

src/ripple/protocol/jss.h

@@ -562,6 +562,7 @@ JSS(sub_index);             // in: LedgerEntry
 JSS(subcommand);            // in: PathFind
 JSS(success);               // rpc
 JSS(supported);             // out: AmendmentTableImpl
+JSS(sync);                  // in: Submit


I like submit_mode or maybe just mode (since it would be a param to a "submit" command, it would be a "submit mode" either way).

src/test/jtx/Env_test.cpp

src/test/app/Transaction_ordering_test.cpp

justinr1234 · 2023-06-21T01:33:50Z

async: This returns immediately, and is actually faster for the client. However, transaction submission errors that occur only once the transaction is applied will be unknown to the client. This should be used for performance testing.

Is it possible to asynchronously send an error message to the websocket connection that the client program can still handle later?

src/ripple/basics/SubmitSync.h

src/ripple/app/ledger/impl/LedgerMaster.cpp

src/ripple/app/misc/NetworkOPs.cpp

src/test/jtx/Env_test.cpp

mtrippled · 2023-06-26T23:11:18Z

I created an issue for this so it can be considered in a future project: #4587

mDuo13 · 2023-09-15T19:34:08Z

Proposed release notes blurb:

Changed transaction processing so that the initial processing happens in batches every 100ms, which allows for much higher transaction throughput. However, this adds latency of up to 100ms (50ms on average) to processing of transactions from the peer-to-peer network. If you are connected to multiple servers and sending transactions less than 100ms apart, be sure to track your sequence numbers manually; otherwise, autofilling may use the same sequence number for different transactions, resulting in one of those transactions never being confirmed by consensus. (When this happens, you may get the result code tefPAST_SEQ from submitting transactions, but even transactions with an initial result of tesSUCCESS may fail to reach a consensus due to a conflict of sequence numbers.)

Added a new sync_mode parameter to the submit command to control batch processing of submitted transactions. The default value, sync, is the same as current behavior, to process each transaction individually. Other values allow transactions to be processed in batches every 100ms. Use async to return immediately with a terSUBMITTED result code, or wait to return only after the transaction batch has been processed.

mtrippled · 2023-09-15T22:13:40Z

Proposed release notes blurb:

Changed transaction processing so that the initial processing happens in batches every 100ms, which allows for much higher transaction throughput. However, this adds latency of up to 100ms (50ms on average) to processing of transactions from the peer-to-peer network. If you are connected to multiple servers and sending transactions less than 100ms apart, be sure to track your sequence numbers manually; otherwise, autofilling may use the same sequence number for different transactions, resulting in one of those transactions never being confirmed by consensus. (When this happens, you may get the result code tefPAST_SEQ from submitting transactions, but even transactions with an initial result of tesSUCCESS may fail to reach a consensus due to a conflict of sequence numbers.)

Good. Perhaps reiterate that this is only a potential problem if people use multiple rippled servers for the same sending account and rapidly send transactions. Up to you. Otherwise, thanks @mDuo13

Added a new sync_mode parameter to the submit command to control batch processing of submitted transactions. The default value, sync, is the same as current behavior, to process each transaction individually. Other values allow transactions to be processed in batches every 100ms. Use async to return immediately with a terSUBMITTED result code, or wait to return only after the transaction batch has been processed.

Add new transaction submission API field, "sync", which determines behavior of the server while submitting transactions: 1) sync (default): Process transactions in a batch immediately, and return only once the transaction has been processed. 2) async: Put transaction into the batch for the next processing interval and return immediately. 3) wait: Put transaction into the batch for the next processing interval and return only after it is processed.

This reverts commit b580049.

Add new transaction submission API field, "sync", which determines behavior of the server while submitting transactions: 1) sync (default): Process transactions in a batch immediately, and return only once the transaction has been processed. 2) async: Put transaction into the batch for the next processing interval and return immediately. 3) wait: Put transaction into the batch for the next processing interval and return only after it is processed.

This reverts commit b580049.

Add new transaction submission API field, "sync", which determines behavior of the server while submitting transactions: 1) sync (default): Process transactions in a batch immediately, and return only once the transaction has been processed. 2) async: Put transaction into the batch for the next processing interval and return immediately. 3) wait: Put transaction into the batch for the next processing interval and return only after it is processed.

This reverts commit 002893f.

This reverts commit 002893f. Therefore two files with conflicts in the automated revert: - src/ripple/rpc/impl/RPCHelpers.h and - src/test/rpc/JSONRPC_test.cpp Those files were manually resolved. There is currently no evidence that any problems were introduced by XRPLF#4504. However something is misbehaving on the current state of develop, and pull request XRPLF#4504 was identified as a possible suspect.

This reverts commit 002893f. There were two files with conflicts in the automated revert: - src/ripple/rpc/impl/RPCHelpers.h and - src/test/rpc/JSONRPC_test.cpp Those files were manually resolved.

intelliot · 2024-01-10T18:41:36Z

Open question: How would this perform if the transaction rebroadcast interval is reduced?

Retransmit more often, so txs are stuck on one server for less time?

intelliot · 2024-01-25T07:01:44Z

Requires additional testing (and possibly debugging)

…XRPLF#4852) This reverts commit 002893f. There were two files with conflicts in the automated revert: - src/ripple/rpc/impl/RPCHelpers.h and - src/test/rpc/JSONRPC_test.cpp Those files were manually resolved.

…XRPLF#4852) This reverts co 10000 mmit 002893f. There were two files with conflicts in the automated revert: - src/ripple/rpc/impl/RPCHelpers.h and - src/test/rpc/JSONRPC_test.cpp Those files were manually resolved.

This reverts commit b580049.

Add new transaction submission API field, "sync", which determines behavior of the server while submitting transactions: 1) sync (default): Process transactions in a batch immediately, and return only once the transaction has been processed. 2) async: Put transaction into the batch for the next processing interval and return immediately. 3) wait: Put transaction into the batch for the next processing interval and return only after it is processed.

…XRPLF#4852) This reverts commit 002893f. There were two files with conflicts in the automated revert: - src/ripple/rpc/impl/RPCHelpers.h and - src/test/rpc/JSONRPC_test.cpp Those files were manually resolved.

@mtrippled

Combines four related changes: 1. "Decrease `shouldRelay` limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point the `HashRouter` entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently). 2. "Give a transaction more chances to be retried." Will put a transaction into `LedgerMaster`'s held transactions if the transaction gets a `ter`, `tel`, or `tef` result. Old behavior was just `ter`. * Additionally, to prevent a transaction from being repeatedly held indefinitely, it must meet some extra conditions. (Documented in a comment in the code.) 3. "Pop all transactions with sequential sequences, or tickets." When a transaction is processed successfully, currently, one held transaction for the same account (if any) will be popped out of the held transactions list, and queued up for the next transaction batch. This change pops all transactions for the account, but only if they have sequential sequences (for non-ticket transactions) or use a ticket. This issue was identified from interactions with @mtrippled's #4504, which was merged, but unfortunately reverted later by #4852. When the batches were spaced out, it could potentially take a very long time for a large number of held transactions for an account to get processed through. However, whether batched or not, this change will help get held transactions cleared out, particularly if a missing earlier transaction is what held them up. 4. "Process held transactions through existing NetworkOPs batching." In the current processing, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic in `NetworkOPs::apply` which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems.

@mtrippled

* refactor: Remove unused and add missing includes (#5293) The codebase is filled with includes that are unused, and which thus can be removed. At the same time, the files often do not include all headers that contain the definitions used in those files. This change uses clang-format and clang-tidy to clean up the includes, with minor manual intervention to ensure the code compiles on all platforms. * refactor: Calculate numFeatures automatically (#5324) Requiring manual updates of numFeatures is an annoying manual process that is easily forgotten, and leads to frequent merge conflicts. This change takes advantage of the `XRPL_FEATURE` and `XRPL_FIX` macros, and adds a new `XRPL_RETIRE` macro to automatically set `numFeatures`. * refactor: Improve ordering of headers with clang-format (#5343) Removes all manual header groupings from source and header files by leveraging clang-format options. * Rename "deadlock" to "stall" in `LoadManager` (#5341) What the LoadManager class does is stall detection, which is not the same as deadlock detection. In the condition of severe CPU starvation, LoadManager will currently intentionally crash rippled reporting `LogicError: Deadlock detected`. This error message is misleading as the condition being detected is not a deadlock. This change fixes and refactors the code in response. * Adds hub.xrpl-commons.org as a new Bootstrap Cluster (#5263) * fix: Error message for ledger_entry rpc (#5344) Changes the error to `malformedAddress` for `permissioned_domain` in the `ledger_entry` rpc, when the account is not a string. This change makes it more clear to a user what is wrong with their request. * fix: Handle invalid marker parameter in grpc call (#5317) The `end_marker` is used to limit the range of ledger entries to fetch. If `end_marker` is less than `marker`, a crash can occur. This change adds an additional check. * fix: trust line RPC no ripple flag (#5345) The Trustline RPC `no_ripple` flag gets set depending on `lsfDefaultRipple` flag, which is not a flag of a trustline but of the account root. The `lsfDefaultRipple` flag does not provide any insight if this particular trust line has `lsfLowNoRipple` or `lsfHighNoRipple` flag set, so it should not be used here at all. This change simplifies the logic. * refactor: Updates Conan dependencies: RocksDB (#5335) Updates RocksDB to version 9.7.3, the latest version supported in Conan 1.x. A patch for 9.7.4 that fixes a memory leak is included. * fix: Remove null pointer deref, just do abort (#5338) This change removes the existing undefined behavior from `LogicError`, so we can be certain that there will be always a stacktrace. De-referencing a null pointer is an old trick to generate `SIGSEGV`, which would typically also create a stacktrace. However it is also an undefined behaviour and compilers can do something else. A more robust way to create a stacktrace while crashing the program is to use `std::abort`, which we have also used in this location for a long time. If we combine the two, we might not get the expected behaviour - namely, the nullpointer deref followed by `std::abort`, as handled in certain compiler versions may not immediately cause a crash. We have observed stacktrace being wiped instead, and thread put in indeterminate state, then stacktrace created without any useful information. * chore: Add PR number to payload (#5310) This PR adds one more payload field to the libXRPL compatibility check workflow - the PR number itself. * chore: Update link to ripple-binary-codec (#5355) The link to ripple-binary-codec's definitions.json appears to be outdated. The updated link is also documented here: https://xrpl.org/docs/references/protocol/binary-format#definitions-file * Prevent consensus from getting stuck in the establish phase (#5277) - Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. - "Stalled" is defined as: - We have a close time consensus - Each disputed transaction is individually stalled: - It has been in the final "stuck" 95% requirement for at least 2 (avMIN_ROUNDS) "inner rounds" of phaseEstablish, - and either all of the other trusted proposers or this validator, if proposing, have had the same vote(s) for at least 4 (avSTALLED_ROUNDS) "inner rounds", and at least 80% of the validators (including this one, if appropriate) agree about the vote (whether yes or no). - If we have been in the establish phase for more than 10x the previous consensus establish phase's time, then consensus is considered "expired", and we will leave the round, which sends a partial validation (indicating that the node is moving on without validating). Two restrictions avoid prematurely exiting, or having an extended exit in extreme situations. - The 10x time is clamped to be within a range of 15s (ledgerMAX_CONSENSUS) to 120s (ledgerABANDON_CONSENSUS). - If consensus has not had an opportunity to walk through all avalanche states (defined as not going through 8 "inner rounds" of phaseEstablish), then ConsensusState::Expired is treated as ConsensusState::No. - When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together. * test: enable TxQ unit tests work with variable reference fee (#5118) In preparation for a potential reference fee change we would like to verify that fee change works as expected. The first step is to fix all unit tests to be able to work with different reference fee values. * test: enable unit tests to work with variable reference fee (#5145) Fix remaining unit tests to be able to process reference fee values other than 10. * Intrusive SHAMap smart pointers for efficient memory use and lock-free synchronization (#5152) The main goal of this optimisation is memory reduction in SHAMapTreeNodes by introducing intrusive pointers instead of standard std::shared_ptr and std::weak_ptr. * refactor: Move integration tests from 'examples/' into 'tests/' (#5367) This change moves `examples/example` into `tests/conan` to make it clear it is an integration test, and adjusts the `conan` CI job accordingly * test: enable compile time param to change reference fee value (#5159) Adds an extra CI pipeline to perform unit tests using different values for fees. * Fix undefined uint128_t type on Windows non-unity builds (#5377) As part of import optimization, a transitive include had been removed that defined `BOOST_COMP_MSVC` on Windows. In unity builds, this definition was pulled in, but in non-unity builds it was not - causing a compilation error. An inspection of the Boost code revealed that we can just gate the statements by `_MS_VER` instead. A `#pragma message` is added to verify that the statement is only printed on Windows builds. * fix: uint128 ambiguousness breaking macos unity build (#5386) * Fix to correct memory ordering for compare_exchange_weak and wait in the intrusive reference counting logic (#5381) This change addresses a memory ordering assertion failure observed on one of the Windows test machines during the IntrusiveShared_test suite. * fix: disable `channel_authorize` when `signing_support` is disabled (#5385) * fix: Use the build image from ghcr.io (#5390) The ci pipelines are constantly hitting Docker Hub's public rate limiting since increasing the number of jobs we're running. This change switches over to images hosted in GitHub's registry. * Remove UNREACHABLE from `NetworkOPsImp::processTrustedProposal` (#5387) It’s possible for this to happen legitimately if a set of peers, including a validator, are connected in a cycle, and the latency and message processing time between those peers is significantly less than the latency between the validator and the last peer. It’s unlikely in the real world, but obviously easy to simulate with Antithesis. * Instrument proposal, validation and transaction messages (#5348) Adds metric counters for the following P2P message types: * Untrusted proposal and validation messages * Duplicate proposal, validation and transaction messages * refactor(trivial): reorganize ledger entry tests and helper functions (#5376) This PR splits out `ledger_entry` tests into its own file (`LedgerEntry_test.cpp`) and alphabetizes the helper functions in `LedgerEntry.cpp`. These commits were split out of #5237 to make that PR a little more manageable, since these basic trivial changes are most of the diff. There is no code change, just moving code around. * fix: `fixPayChanV1` (#4717) This change introduces a new fix amendment (`fixPayChanV1`) that prevents the creation of new `PaymentChannelCreate` transaction with a `CancelAfter` time less than the current ledger time. It piggy backs off of fix1571. Once the amendment is activated, creating a new `PaymentChannel` will require that if you specify the `CancelAfter` time/value, that value must be greater than or equal to the current ledger time. Currently users can create a payment channel where the `CancelAfter` time is before the current ledger time. This results in the payment channel being immediately closed on the next PaymentChannel transaction. * Fix: admin RPC webhook queue limit removal and timeout reduction (#5163) When using subscribe at admin RPC port to send webhooks for the transaction stream to a backend, on large(r) ledgers the endpoint receives fewer HTTP POSTs with TX information than the amount of transactions in a ledger. This change removes the hardcoded queue length to avoid dropping TX notifications for the admin-only command. In addition, the per-request TTL for outgoing RPC HTTP calls has been reduced from 10 minutes to 30 seconds. * fix: Adds CTID to RPC tx and updates error (#4738) This change fixes a number of issues involved with CTID: * CTID is not present on all RPC tx transactions. * rpcWRONG_NETWORK is missing in the ErrorCodes.cpp * Temporary disable automatic triggering macOS pipeline (#5397) We temporarily disable running unit tests on macOS on the CI pipeline while we are investigating the delays. * refactor: Clean up test logging to make it easier to search (#5396) This PR replaces the word `failed` with `failure` in any test names and renames some test files to fix MSVC warnings, so that it is easier to search through the test output to find tests that failed. * chore: Run CI on PRs that are Ready or have the "DraftRunCI" label (#5400) - Avoids costly overhead for idle PRs where the CI results don't add any value. * fix: CTID to use correct ledger_index (#5408) * chore: Small clarification to lsfDefaultRipple comment (#5410) * fix: Replaces random endpoint resolution with sequential (#5365) This change addresses an issue where `rippled` attempts to connect to an IPv6 address, even when the local network lacks IPv6 support, resulting in a "Network is unreachable" error. The fix replaces the custom endpoint selection logic with `boost::async_connect`, which sequentially attempts to connect to available endpoints until one succeeds or all fail. * Improve transaction relay logic (#4985) Combines four related changes: 1. "Decrease `shouldRelay` limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point the `HashRouter` entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently). 2. "Give a transaction more chances to be retried." Will put a transaction into `LedgerMaster`'s held transactions if the transaction gets a `ter`, `tel`, or `tef` result. Old behavior was just `ter`. * Additionally, to prevent a transaction from being repeatedly held indefinitely, it must meet some extra conditions. (Documented in a comment in the code.) 3. "Pop all transactions with sequential sequences, or tickets." When a transaction is processed successfully, currently, one held transaction for the same account (if any) will be popped out of the held transactions list, and queued up for the next transaction batch. This change pops all transactions for the account, but only if they have sequential sequences (for non-ticket transactions) or use a ticket. This issue was identified from interactions with @mtrippled's #4504, which was merged, but unfortunately reverted later by #4852. When the batches were spaced out, it could potentially take a very long time for a large number of held transactions for an account to get processed through. However, whether batched or not, this change will help get held transactions cleared out, particularly if a missing earlier transaction is what held them up. 4. "Process held transactions through existing NetworkOPs batching." In the current processing, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic in `NetworkOPs::apply` which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems. * Enable passive squelching (#5358) This change updates the squelching logic to accept squelch messages for untrusted validators. As a result, servers will also squelch untrusted validator messages reducing duplicate traffic they generate. In particular: * Updates squelch message handling logic to squelch messages for all validators, not only trusted ones. * Updates the logic to send squelch messages to peers that don't squelch themselves * Increases the threshold for the number of messages that a peer has to deliver to consider it as a candidate for validator messages. * Add PermissionDelegation feature (#5354) This change implements the account permission delegation described in XLS-75d, see XRPLF/XRPL-Standards#257. * Introduces transaction-level and granular permissions that can be delegated to other accounts. * Adds `DelegateSet` transaction to grant specified permissions to another account. * Adds `ltDelegate` ledger object to maintain the permission list for delegating/delegated account pair. * Adds an optional `Delegate` field in common fields, allowing a delegated account to send transactions on behalf of the delegating account within the granted permission scope. The `Account` field remains the delegating account; the `Delegate` field specifies the delegated account. The transaction is signed by the delegated account. * refactor: use east const convention (#5409) This change refactors the codebase to use the "east const convention", and adds a clang-format rule to follow this convention. * fix: enable LedgerStateFix for delegation (#5427) * Configure CODEOWNERS for changes to RPC code (#5266) To ensure changes to any RPC-related code are compatible with other services, such as Clio, the RPC team will be required to review them. * fix: Ensure that coverage file generation is atomic. (#5426) Running unit tests in parallel and multiple threads can write into one file can corrupt output files, and then gcovr won't be able to parse the corrupted file. This change adds -fprofile-update=atomic as instructed by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68080. * fix: Update validators-example.txt fix xrplf example URL (#5384) * Fix: Resolve slow test on macOS pipeline (#5392) Using std::barrier performs extremely poorly (~1 hour vs ~1 minute to run the test suite) in certain macOS environments. To unblock our macOS CI pipeline, std::barrier has been replaced with a custom mutex-based barrier (Barrier) that significantly improves performance without compromising correctness. * Set version to 2.5.0-b1 --------- Co-authored-by: Bart <bthomee@users.noreply.github.com> Co-authored-by: Ed Hennis <ed@ripple.com> Co-authored-by: Bronek Kozicki <brok@incorrekt.com> Co-authored-by: Darius Tumas <Tokeiito@users.noreply.github.com> Co-authored-by: Sergey Kuznetsov <skuznetsov@ripple.com> Co-authored-by: cyan317 <120398799+cindyyan317@users.noreply.github.com> Co-authored-by: Vlad <129996061+vvysokikh1@users.noreply.github.com> Co-authored-by: Alex Kremer <akremer@ripple.com> Co-authored-by: Valentin Balaschenko <13349202+vlntb@users.noreply.github.com> Co-authored-by: Mayukha Vadari <mvadari@ripple.com> Co-authored-by: Vito Tumas <5780819+Tapanito@users.noreply.github.com> Co-authored-by: Denis Angell <dangell@transia.co> Co-authored-by: Wietse Wind <w.wind@ipublications.net> Co-authored-by: yinyiqian1 <yqian@ripple.com> Co-authored-by: Jingchen <a1q123456@users.noreply.github.com> Co-authored-by: brettmollin <brettmollin@users.noreply.github.com>

mtrippled force-pushed the periodic branch from 24c718d to 5491a1e Compare April 21, 2023 02:55

This was referenced Apr 21, 2023

(Reverted by #4882) Asynchronously write batches to NuDB. #4503

Merged

Several changes to improve Consensus stability: #4505

Merged

HowardHinnant self-requested a review April 21, 2023 13:14

intelliot added this to the TPS milestone Apr 24, 2023

a-noni-mousse reviewed May 5, 2023

View reviewed changes

src/ripple/app/misc/NetworkOPs.cpp Outdated Show resolved Hide resolved

src/ripple/app/misc/NetworkOPs.cpp Show resolved Hide resolved

src/ripple/rpc/impl/RPCHelpers.cpp Outdated Show resolved Hide resolved

HowardHinnant reviewed May 9, 2023

View reviewed changes

src/ripple/app/misc/NetworkOPs.h Outdated Show resolved Hide resolved

HowardHinnant reviewed May 9, 2023

View reviewed changes

src/ripple/core/Config.h Outdated Show resolved Hide resolved

HowardHinnant reviewed May 9, 2023

View reviewed changes

src/ripple/app/misc/NetworkOPs.cpp Outdated Show resolved Hide resolved

HowardHinnant approved these changes May 9, 2023

View reviewed changes

intelliot requested a review from ximinez June 9, 2023 00:46

intelliot assigned ximinez Jun 9, 2023

intelliot added the Resource Improvement label Jun 9, 2023

ximinez requested changes Jun 14, 2023

View reviewed changes

ximinez reviewed Jun 22, 2023

View reviewed changes

src/ripple/basics/SubmitSync.h Show resolved Hide resolved

src/ripple/app/ledger/impl/LedgerMaster.cpp Outdated Show resolved Hide resolved

src/ripple/app/misc/NetworkOPs.cpp Outdated Show resolved Hide resolved

ximinez requested changes Jun 23, 2023

View reviewed changes

src/test/jtx/Env_test.cpp Outdated Show resolved Hide resolved

mtrippled force-pushed the periodic branch from dc393dc to 047acf2 Compare June 26, 2023 22:00

mtrippled mentioned this pull request Jun 26, 2023

Interface for high volume client submissions #4587

Open

mtrippled closed this Jun 26, 2023

mtrippled reopened this Jun 26, 2023

intelliot added API Change Request for Comments Will Need Documentation labels Jun 27, 2023

intelliot merged commit 002893f into XRPLF:develop Sep 11, 2023

ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 22, 2023

Revert "Apply transaction batches in periodic intervals (XRPLF#4504)"

0a0e2fc

This reverts commit b580049.

ckeshava pushed a commit to ckeshava/rippled that referenced this pull request Sep 25, 2023

Revert "Apply transaction batches in periodic intervals (XRPLF#4504)"

70ccfda

This reverts commit b580049.

ximinez added a commit to ximinez/rippled that referenced this pull request Dec 14, 2023

Revert "Apply transaction batches in periodic intervals (XRPLF#4504)"

ec8f141

This reverts commit 002893f.

ximinez added a commit to ximinez/rippled that referenced this pull request Dec 14, 2023

Revert "Apply transaction batches in periodic intervals (XRPLF#4504)"

01d2c57

This reverts commit 002893f.

scottschurr mentioned this pull request Dec 18, 2023

Revert "Apply transaction batches in periodic intervals (#4504)" #4852

Merged

1 task

intelliot mentioned this pull request Dec 20, 2023

Proposed 2.0.0-rc6 #4853

Merged

2 tasks

intelliot added the Reverted Changes which should still be considered for re-merging. See "Closed" PRs with this label label Jan 10, 2024

intelliot modified the milestones: TPS 2023-09, TPS Jan 10, 2024

intelliot added Perf Attn Needed Attention needed from RippleX Performance Team and removed Performance/Resource Improvement labels Jan 11, 2024

ximinez mentioned this pull request Apr 10, 2024

Improve transaction relay logic #4985

Merged

3 tasks

sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024

Revert "Apply transaction batches in periodic intervals (XRPLF#4504)"

2ac4248

This reverts commit b580049.

Apply transaction batches in periodic intervals. #4504

Apply transaction batches in periodic intervals. #4504

Uh oh!

Conversation

Uh oh!

High Level Overview of Change

Context of Change

Type of Change

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!