-
Notifications
You must be signed in to change notification settings - Fork 715
refactor(state-sync): Refer to other nodes by PeerId #9591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* feat(resharding): added metrics * shard_uid * code review * shard_uid
The `lib.rs` file accumulated structs and functionality from different areas. Split the file into several smaller files.
…ration (#9539) * remove force_* features and move runtime selection to runtime configuration There are a couple of reasons, but the most major one is that these `force_*` fetures make it impossible to run tools with `--all-features` enabled on macOS (where use of wasmer0 would be forced somewhat arbitrarily.) That said, these options don't make sense, in the world of limited replayability, and with runtime configuration there is also now a fairly straightforward approach to forcing the use of a different runtime than the "default" (just chnge some yamls.) As an upside this also forced my hand to go and fix the Config::test nonsense, a little bit, hence Fixes #8202 and tests will now actually exercise the system in a configuration that is going to run in production most of the time! Yay! * Replace !0 with PROTOCOL_VERSION * Remove the wasmtime feature from runtime-params-estimator * Fix test expectations
* debug: Add #[perf] to more actors This allows to observe more details about actor utilization when compiling with `--features=performance_stats`. Running some loadtest with this reveals that `ClientActor` and `ShardsManagerActor` are the busiest in some cases, which would not be possible to observe without the instrumentation added in this commit. * cargo fmt
Chunks can be MBs in size. We should not force a read from DB when it's not cached already. Also alters the `chunk_exists` method to check the cache first. An `exists` call on the DB is still more expensive than looking up an `Arc` in the in-memory cache.
* Logging: Deprecate --verbose --verbose is not a bool, but a String argument, which is confusing. * Logging: Deprecate --verbose --verbose is not a bool, but a String argument, which is confusing. * Logging: Deprecate --verbose --verbose is not a bool, but a String argument, which is confusing.
* Logging: New delay detector * Logging: New delay detector * Logging: New delay detector * Logging: New delay detector * Rename to SpanDurationLogger * Use Durations instead of nanoseconds * refactor: Tracing (#9544) The `lib.rs` file accumulated structs and functionality from different areas. Split the file into several smaller files. * Logging: New delay detector * Rename to SpanDurationLogger * Use Durations instead of nanoseconds * Use Durations instead of nanoseconds * refactor: Tracing (#9544) The `lib.rs` file accumulated structs and functionality from different areas. Split the file into several smaller files. * Logging: New delay detector * refactor: Tracing (#9544) The `lib.rs` file accumulated structs and functionality from different areas. Split the file into several smaller files. * Use Durations instead of nanoseconds
The clone was only there because of lifetime complaints that were due to the message being moved to the test observer after processing. We can avoid the clone entirely in non-test setup. This is important because these messages can be 100MB+
Continues #9540 Deletes the `DelayDetector` class and replaces the uses of it with spans. --------- Co-authored-by: Anton Puhach <anton@near.org> Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me> Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
RoutingTableView contains a table Direct Peers with one entry per peer. The table displays the latest distance vector shared by each peer. This PR: - Adds the Min Nonce for each distance vector to the display, indicating the age of the received information. - Moves any peers in a disconnected state to a separate table for clarity. <img width="1448" alt="image" src="https://github.com/near/nearcore/assets/3241341/8488eb07-7adf-4c7c-838e-165ca740d2fa">
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it simplifies the code a lot. Generally makes sense to me.
But it feels like I'm missing some context / don't know this code enough.
It would make sense to wait for an approval by @VanBarbascu before merging.
.cares_about_shard_from_prev_block(&prev_block_hash, account_id, shard_id) | ||
.unwrap_or(false) | ||
{ | ||
// If we are one of the validators (me is not None) - then make sure that we don't try to send request to ourselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still work after the refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because the list of peers excludes 'ourselves', and we no longer refer to potential targets by account id.
let peer_id = match target { | ||
AccountOrPeerIdOrHash::AccountId(_) => return NetworkResponses::RouteNotFound, | ||
AccountOrPeerIdOrHash::PeerId(peer_id) => peer_id, | ||
AccountOrPeerIdOrHash::Hash(_) => return NetworkResponses::RouteNotFound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any places where AccountOrPeerIdOrHash::Hash(_)
have been constructed. Was that a dead enum variant already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems so to me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! We are no longer using Hash and AccountId to refer to peers in state sync.
Versions 0.5.1 - 0.5.7 were yanked due to a race condition. See [changelog](https://github.com/crossbeam-rs/crossbeam/blob/master/crossbeam-channel/CHANGELOG.md#version-058) and [crates.io](https://crates.io/crates/crossbeam-channel/versions) and [PR with the fix](crossbeam-rs/crossbeam#972).
…manager, runtime in integration tests (#9595) We had a common pattern of creating a bunch of manual stores, epoch managers, and nightshade runtime for the test environment. This PR brings the default functionality to the test environment builder.
The point of this file is to quickly see the current parameters. Showing the nightly changes is not in this spirit IMO.
`cargo-deny advisories` complains about an old libsqlite-sys version. We depend on it through rusqlite, which is only used in the `estimator-warehouse`, a CI /&debug script. So it's not super important but still worth updating.
After trying out a couple of things to try to get the unit tests working with snapshotting for our resharding module, I feel like the complexity of resharding functions has reached a place where it might be fine to just remove the unit test module in favor of the integration tests. We have more flexibility with the integration tests and for now, the unit test and the integration tests are basically doing the same thing, The effort to make the unit test work with snapshot is probably not worth it.
… a precise block (#9598) With the current implementation for flat storage iterator, we have the issue that we can not control the block over which we create the iterator. A solution to that is to use the snapshot mechanism from state sync. We can assume that the snapshot would be created on all nodes as of the last block of an epoch. The way we get the correct state for resharding is that we take the snapshot as of the `prev_prev_hash` and append the delta from the `prev_hash` so that the state of the child tries/shards matches that of the last block of prev epoch.
…9630) Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.13 to 1.26.17. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/urllib3/urllib3/releases">urllib3's releases</a>.</em></p> <blockquote> <h2>1.26.17</h2> <ul> <li>Added the <code>Cookie</code> header to the list of headers to strip from requests when redirecting to a different host. As before, different headers can be set via <code>Retry.remove_headers_on_redirect</code>. (GHSA-v845-jxx5-vc9f)</li> </ul> <h2>1.26.16</h2> <ul> <li>Fixed thread-safety issue where accessing a <code>PoolManager</code> with many distinct origins would cause connection pools to be closed while requests are in progress (<a href="https://redirect.github.com/urllib3/urllib3/issues/2954">#2954</a>)</li> </ul> <h2>1.26.15</h2> <ul> <li>Fix socket timeout value when HTTPConnection is reused (<a href="https://redirect.github.com/urllib3/urllib3/issues/2645">urllib3/urllib3#2645</a>)</li> <li>Remove "!" character from the unreserved characters in IPv6 Zone ID parsing (<a href="https://redirect.github.com/urllib3/urllib3/issues/2899">urllib3/urllib3#2899</a>)</li> <li>Fix IDNA handling of 'x80' byte (<a href="https://redirect.github.com/urllib3/urllib3/issues/2901">urllib3/urllib3#2901</a>)</li> </ul> <h2>1.26.14</h2> <ul> <li>Fixed parsing of port 0 (zero) returning None, instead of 0 (<a href="https://redirect.github.com/urllib3/urllib3/issues/2850">#2850</a>)</li> <li>Removed deprecated <code>HTTPResponse.getheaders()</code> calls in <code>urllib3.contrib</code> module.</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/urllib3/urllib3/blob/main/CHANGES.rst">urllib3's changelog</a>.</em></p> <blockquote> <h1>1.26.17 (2023-10-02)</h1> <ul> <li>Added the <code>Cookie</code> header to the list of headers to strip from requests when redirecting to a different host. As before, different headers can be set via <code>Retry.remove_headers_on_redirect</code>. (<code>[#3139](urllib3/urllib3#3139) <https://github.com/urllib3/urllib3/pull/3139></code>_)</li> </ul> <h1>1.26.16 (2023-05-23)</h1> <ul> <li>Fixed thread-safety issue where accessing a <code>PoolManager</code> with many distinct origins would cause connection pools to be closed while requests are in progress (<code>[#2954](urllib3/urllib3#2954) <https://github.com/urllib3/urllib3/pull/2954></code>_)</li> </ul> <h1>1.26.15 (2023-03-10)</h1> <ul> <li>Fix socket timeout value when <code>HTTPConnection</code> is reused (<code>[#2645](urllib3/urllib3#2645) <https://github.com/urllib3/urllib3/issues/2645></code>__)</li> <li>Remove "!" character from the unreserved characters in IPv6 Zone ID parsing (<code>[#2899](urllib3/urllib3#2899) <https://github.com/urllib3/urllib3/issues/2899></code>__)</li> <li>Fix IDNA handling of '\x80' byte (<code>[#2901](urllib3/urllib3#2901) <https://github.com/urllib3/urllib3/issues/2901></code>__)</li> </ul> <h1>1.26.14 (2023-01-11)</h1> <ul> <li>Fixed parsing of port 0 (zero) returning None, instead of 0. (<code>[#2850](urllib3/urllib3#2850) <https://github.com/urllib3/urllib3/issues/2850></code>__)</li> <li>Removed deprecated getheaders() calls in contrib module. Fixed the type hint of <code>PoolKey.key_retries</code> by adding <code>bool</code> to the union. (<code>[#2865](urllib3/urllib3#2865) <https://github.com/urllib3/urllib3/issues/2865></code>__)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/urllib3/urllib3/commit/c9016bf464751a02b7e46f8b86504f47d4238784"><code>c9016bf</code></a> Release 1.26.17</li> <li><a href="https://github.com/urllib3/urllib3/commit/01220354d389cd05474713f8c982d05c9b17aafb"><code>0122035</code></a> Backport GHSA-v845-jxx5-vc9f (<a href="https://redirect.github.com/urllib3/urllib3/issues/3139">#3139</a>)</li> <li><a href="https://github.com/urllib3/urllib3/commit/e63989f97d206e839ab9170c8a76e3e097cc60e8"><code>e63989f</code></a> Fix installing <code>brotli</code> extra on Python 2.7</li> <li><a href="https://github.com/urllib3/urllib3/commit/2e7a24d08713a0131f0b3c7197889466d645cc49"><code>2e7a24d</code></a> [1.26] Configure OS for RTD to fix building docs</li> <li><a href="https://github.com/urllib3/urllib3/commit/57181d6ea910ac7cb2ff83345d9e5e0eb816a0d0"><code>57181d6</code></a> [1.26] Improve error message when calling urllib3.request() (<a href="https://redirect.github.com/urllib3/urllib3/issues/3058">#3058</a>)</li> <li><a href="https://github.com/urllib3/urllib3/commit/3c0148048a523325819377b23fc67f8d46afc3aa"><code>3c01480</code></a> [1.26] Run coverage even with failed jobs</li> <li><a href="https://github.com/urllib3/urllib3/commit/d94029b7e2193ff47b627906a70e06377a09aae8"><code>d94029b</code></a> Release 1.26.16</li> <li><a href="https://github.com/urllib3/urllib3/commit/18e92145e9cddbabdf51c98f54202aa37fd5d4c8"><code>18e9214</code></a> Use trusted publishing for PyPI</li> <li><a href="https://github.com/urllib3/urllib3/commit/d25cf83bbae850a290fe34ed1610ae55c0558b36"><code>d25cf83</code></a> [1.26] Fix invalid test_ssl_failure_midway_through_conn</li> <li><a href="https://github.com/urllib3/urllib3/commit/25cca389496b86ee809c21e5b641aeaa74809263"><code>25cca38</code></a> [1.26] Fix test_ssl_object_attributes</li> <li>Additional commits viewable in <a href="https://github.com/urllib3/urllib3/compare/1.26.13...1.26.17">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/near/nearcore/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but is it pure refactoring ?
@@ -535,61 +520,33 @@ impl StateSync { | |||
/// Only select candidates that we have no pending request currently ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment still up to date? The part about validators being candidates in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removed the comment about the validators.
epoch_manager.get_epoch_block_producers_ordered(&epoch_hash, &sync_hash)?; | ||
let peers = block_producers | ||
) -> Result<Vec<PeerId>, near_chain::Error> { | ||
let peers = highest_height_peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You no longer consider the validators as candidates, do you? Are there any drawbacks to this approach? Do we lose the guarantee that at least some peers will for sure have the needed data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I argue that mainnet validators block requests related to state sync and therefore removing them has no effect on state sync, or even makes it more reliable.
Given the low diameter of the network topology graph, such a peer should be available in TIER2. And if not, then that should be handled by the decentralized state sync separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not work and the new design will talk to peers regardless of their status in the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I argue that mainnet validators block requests related to state sync and therefore removing them has no effect on state sync, or even makes it more reliable.
Thank you, that makes sense.
Given the low diameter of the network topology graph, such a peer should be available in TIER2.
I believe all peers are available in TIER2. I assumed that we're talking about TIER2 direct peers here, is that right?
And if not, then that should be handled by the decentralized state sync separately.
Yep, that makes perfect sense to me. I'm not sure where exactly stateless validation is heading in terms of number of shards but let's design the solution to handle a case where it's quite likely none of the direct peers have the required state. This can happen e.g. when #shards > #direct peers.
AccountOrPeerIdOrHash::AccountId(_) => return NetworkResponses::RouteNotFound, | ||
AccountOrPeerIdOrHash::PeerId(peer_id) => peer_id, | ||
AccountOrPeerIdOrHash::Hash(_) => return NetworkResponses::RouteNotFound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the point of account id and hash if it always returned RouteNotFound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AccountId used to work because @VanBarbascu changed the messages to be Direct instead of Routed.
Hash hasn't worked ever, no idea what it was for.
Technically no, because it involves a change in behavior. But that behavior wasn't working anyway, therefore a refactor is appropriate :) |
Hehe, ok, shady one ;) |
Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: wacban <wacban@users.noreply.github.com> Co-authored-by: Anton Puhach <anton@near.org> Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me> Co-authored-by: Jakob Meier <mail@jakobmeier.ch> Co-authored-by: Saketh Are <saketh.are@gmail.com> Co-authored-by: Shreyan Gupta <shreyan@near.org> Co-authored-by: Ekleog-NEAR <96595974+Ekleog-NEAR@users.noreply.github.com> Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
No description provided.