8000 refactor(state-sync): Refer to other nodes by PeerId by nikurt · Pull Request #9591 · near/nearcore · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

refactor(state-sync): Refer to other nodes by PeerId #9591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Oct 3, 2023
Merged

Conversation

nikurt
Copy link
Contributor
@nikurt nikurt commented Sep 26, 2023

No description provided.

Nikolay Kurtov and others added 18 commits September 19, 2023 17:38
* feat(resharding): added metrics

* shard_uid

* code review

* shard_uid
The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.
…ration (#9539)

* remove force_* features and move runtime selection to runtime configuration

There are a couple of reasons, but the most major one is that these
`force_*` fetures make it impossible to run tools with `--all-features`
enabled on macOS (where use of wasmer0 would be forced somewhat
arbitrarily.)

That said, these options don't make sense, in the world of limited
replayability, and with runtime configuration there is also now a fairly
straightforward approach to forcing the use of a different runtime than
the "default" (just chnge some yamls.)

As an upside this also forced my hand to go and fix the Config::test
nonsense, a little bit, hence

Fixes #8202

and tests will now actually exercise the system in a configuration that
is going to run in production most of the time! Yay!

* Replace !0 with PROTOCOL_VERSION

* Remove the wasmtime feature from runtime-params-estimator

* Fix test expectations
* debug: Add #[perf] to more actors

This allows to observe more details about actor utilization when
compiling with `--features=performance_stats`.

Running some loadtest with this reveals that `ClientActor`
and `ShardsManagerActor` are the busiest in some cases, which
would not be possible to observe without the instrumentation added
in this commit.

* cargo fmt
Chunks can be MBs in size. We should not force a read from DB when it's
not cached already.

Also alters the `chunk_exists` method to check the cache first.
An `exists` call on the DB is still more expensive than looking up an
`Arc` in the in-memory cache.
* Logging: Deprecate --verbose

--verbose is not a bool, but a String argument, which is confusing.

* Logging: Deprecate --verbose

--verbose is not a bool, but a String argument, which is confusing.

* Logging: Deprecate --verbose

--verbose is not a bool, but a String argument, which is confusing.
* Logging: New delay detector

* Logging: New delay detector

* Logging: New delay detector

* Logging: New delay detector

* Rename to SpanDurationLogger

* Use Durations instead of nanoseconds

* refactor: Tracing (#9544)

The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.

* Logging: New delay detector

* Rename to SpanDurationLogger

* Use Durations instead of nanoseconds

* Use Durations instead of nanoseconds

* refactor: Tracing (#9544)

The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.

* Logging: New delay detector

* refactor: Tracing (#9544)

The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.

* Use Durations instead of nanoseconds
Chunk header `gas_limit` was renamed to `prev_gas_limit` in
#9500. After discussion with
@jakmeier I've realised while it is determined in the previous chunk, it
actually corresponds to gas limit for the current chunk.
The clone was only there because of lifetime complaints that were due to
the message being moved to the test observer after processing. We can
avoid the clone entirely in non-test setup.

This is important because these messages can be 100MB+
Continues #9540

Deletes the `DelayDetector` class and replaces the uses of it with
spans.

---------

Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
RoutingTableView contains a table Direct Peers with one entry per peer.
The table displays the latest distance vector shared by each peer.

This PR:
- Adds the Min Nonce for each distance vector to the display, indicating
the age of the received information.
- Moves any peers in a disconnected state to a separate table for
clarity.

<img width="1448" alt="image"
src="https://github.com/near/nearcore/assets/3241341/8488eb07-7adf-4c7c-838e-165ca740d2fa">
@nikurt nikurt marked this pull request as ready for review September 26, 2023 11:27
@nikurt nikurt requested a review from a team as a code owner September 26, 2023 11:27
Copy link
Contributor
@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it simplifies the code a lot. Generally makes sense to me.
But it feels like I'm missing some context / don't know this code enough.
It would make sense to wait for an approval by @VanBarbascu before merging.

.cares_about_shard_from_prev_block(&prev_block_hash, account_id, shard_id)
.unwrap_or(false)
{
// If we are one of the validators (me is not None) - then make sure that we don't try to send request to ourselves.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still work after the refactor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the list of peers excludes 'ourselves', and we no longer refer to potential targets by account id.

let peer_id = match target {
AccountOrPeerIdOrHash::AccountId(_) => return NetworkResponses::RouteNotFound,
AccountOrPeerIdOrHash::PeerId(peer_id) => peer_id,
AccountOrPeerIdOrHash::Hash(_) => return NetworkResponses::RouteNotFound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any places where AccountOrPeerIdOrHash::Hash(_) have been constructed. Was that a dead enum variant already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems so to me too.

Copy link
Contributor
@VanBarbascu VanBarbascu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! We are no longer using Hash and AccountId to refer to peers in state sync.

@VanBarbascu VanBarbascu requested a review from jakmeier October 2, 2023 13:19
jakmeier and others added 4 commits October 2, 2023 18:45
…manager, runtime in integration tests (#9595)

We had a common pattern of creating a bunch of manual stores, epoch
managers, and nightshade runtime for the test environment. This PR
brings the default functionality to the test environment builder.
The point of this file is to quickly see the current parameters.
Showing the nightly changes is not in this spirit IMO.
`cargo-deny advisories` complains about an old libsqlite-sys version.

We depend on it through rusqlite, which is only used in the
`estimator-warehouse`, a CI /&debug script. So it's not super
important but still worth updating.
@nikurt nikurt requested review from wacban and jakmeier and removed request for jakmeier October 2, 2023 17:26
Nikolay Kurtov and others added 13 commits October 2, 2023 19:31
After trying out a couple of things to try to get the unit tests working
with snapshotting for our resharding module, I feel like the complexity
of resharding functions has reached a place where it might be fine to
just remove the unit test module in favor of the integration tests.

We have more flexibility with the integration tests and for now, the
unit test and the integration tests are basically doing the same thing,
The effort to make the unit test work with snapshot is probably not
worth it.
… a precise block (#9598)

With the current implementation for flat storage iterator, we have the
issue that we can not control the block over which we create the
iterator. A solution to that is to use the snapshot mechanism from state
sync.

We can assume that the snapshot would be created on all nodes as of the
last block of an epoch.

The way we get the correct state for resharding is that we take the
snapshot as of the `prev_prev_hash` and append the delta from the
`prev_hash` so that the state of the child tries/shards matches that of
the last block of prev epoch.
…9630)

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.13 to
1.26.17.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/urllib3/urllib3/releases">urllib3's
releases</a>.</em></p>
<blockquote>
<h2>1.26.17</h2>
<ul>
<li>Added the <code>Cookie</code> header to the list of headers to strip
from requests when redirecting to a different host. As before, different
headers can be set via <code>Retry.remove_headers_on_redirect</code>.
(GHSA-v845-jxx5-vc9f)</li>
</ul>
<h2>1.26.16</h2>
<ul>
<li>Fixed thread-safety issue where accessing a <code>PoolManager</code>
with many distinct origins would cause connection pools to be closed
while requests are in progress (<a
href="https://redirect.github.com/urllib3/urllib3/issues/2954">#2954</a>)</li>
</ul>
<h2>1.26.15</h2>
<ul>
<li>Fix socket timeout value when HTTPConnection is reused (<a
href="https://redirect.github.com/urllib3/urllib3/issues/2645">urllib3/urllib3#2645</a>)</li>
<li>Remove &quot;!&quot; character from the unreserved characters in
IPv6 Zone ID parsing (<a
href="https://redirect.github.com/urllib3/urllib3/issues/2899">urllib3/urllib3#2899</a>)</li>
<li>Fix IDNA handling of 'x80' byte (<a
href="https://redirect.github.com/urllib3/urllib3/issues/2901">urllib3/urllib3#2901</a>)</li>
</ul>
<h2>1.26.14</h2>
<ul>
<li>Fixed parsing of port 0 (zero) returning None, instead of 0 (<a
href="https://redirect.github.com/urllib3/urllib3/issues/2850">#2850</a>)</li>
<li>Removed deprecated <code>HTTPResponse.getheaders()</code> calls in
<code>urllib3.contrib</code> module.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/urllib3/urllib3/blob/main/CHANGES.rst">urllib3's
changelog</a>.</em></p>
<blockquote>
<h1>1.26.17 (2023-10-02)</h1>
<ul>
<li>Added the <code>Cookie</code> header to the list of headers to strip
from requests when redirecting to a different host. As before, different
headers can be set via <code>Retry.remove_headers_on_redirect</code>.
(<code>[#3139](urllib3/urllib3#3139)
&lt;https://github.com/urllib3/urllib3/pull/3139&gt;</code>_)</li>
</ul>
<h1>1.26.16 (2023-05-23)</h1>
<ul>
<li>Fixed thread-safety issue where accessing a <code>PoolManager</code>
with many distinct origins
would cause connection pools to be closed while requests are in progress
(<code>[#2954](urllib3/urllib3#2954)
&lt;https://github.com/urllib3/urllib3/pull/2954&gt;</code>_)</li>
</ul>
<h1>1.26.15 (2023-03-10)</h1>
<ul>
<li>Fix socket timeout value when <code>HTTPConnection</code> is reused
(<code>[#2645](urllib3/urllib3#2645)
&lt;https://github.com/urllib3/urllib3/issues/2645&gt;</code>__)</li>
<li>Remove &quot;!&quot; character from the unreserved characters in
IPv6 Zone ID parsing
(<code>[#2899](urllib3/urllib3#2899)
&lt;https://github.com/urllib3/urllib3/issues/2899&gt;</code>__)</li>
<li>Fix IDNA handling of '\x80' byte
(<code>[#2901](urllib3/urllib3#2901)
&lt;https://github.com/urllib3/urllib3/issues/2901&gt;</code>__)</li>
</ul>
<h1>1.26.14 (2023-01-11)</h1>
<ul>
<li>Fixed parsing of port 0 (zero) returning None, instead of 0.
(<code>[#2850](urllib3/urllib3#2850)
&lt;https://github.com/urllib3/urllib3/issues/2850&gt;</code>__)</li>
<li>Removed deprecated getheaders() calls in contrib module. Fixed the
type hint of <code>PoolKey.key_retries</code> by adding
<code>bool</code> to the union.
(<code>[#2865](urllib3/urllib3#2865)
&lt;https://github.com/urllib3/urllib3/issues/2865&gt;</code>__)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/urllib3/urllib3/commit/c9016bf464751a02b7e46f8b86504f47d4238784"><code>c9016bf</code></a>
Release 1.26.17</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/01220354d389cd05474713f8c982d05c9b17aafb"><code>0122035</code></a>
Backport GHSA-v845-jxx5-vc9f (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3139">#3139</a>)</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/e63989f97d206e839ab9170c8a76e3e097cc60e8"><code>e63989f</code></a>
Fix installing <code>brotli</code> extra on Python 2.7</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/2e7a24d08713a0131f0b3c7197889466d645cc49"><code>2e7a24d</code></a>
[1.26] Configure OS for RTD to fix building docs</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/57181d6ea910ac7cb2ff83345d9e5e0eb816a0d0"><code>57181d6</code></a>
[1.26] Improve error message when calling urllib3.request() (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3058">#3058</a>)</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/3c0148048a523325819377b23fc67f8d46afc3aa"><code>3c01480</code></a>
[1.26] Run coverage even with failed jobs</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/d94029b7e2193ff47b627906a70e06377a09aae8"><code>d94029b</code></a>
Release 1.26.16</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/18e92145e9cddbabdf51c98f54202aa37fd5d4c8"><code>18e9214</code></a>
Use trusted publishing for PyPI</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/d25cf83bbae850a290fe34ed1610ae55c0558b36"><code>d25cf83</code></a>
[1.26] Fix invalid test_ssl_failure_midway_through_conn</li>
<li><a
href="https://github.com/urllib3/urllib3/commit/25cca389496b86ee809c21e5b641aeaa74809263"><code>25cca38</code></a>
[1.26] Fix test_ssl_object_attributes</li>
<li>Additional commits viewable in <a
href="https://github.com/urllib3/urllib3/compare/1.26.13...1.26.17">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=urllib3&package-manager=pip&previous-version=1.26.13&new-version=1.26.17)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts page](https://github.com/near/nearcore/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Copy link
Contributor
@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but is it pure refactoring ?

@@ -535,61 +520,33 @@ impl StateSync {
/// Only select candidates that we have no pending request currently ongoing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment still up to date? The part about validators being candidates in particular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed the comment about the validators.

epoch_manager.get_epoch_block_producers_ordered(&epoch_hash, &sync_hash)?;
let peers = block_producers
) -> Result<Vec<PeerId>, near_chain::Error> {
let peers = highest_height_peers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You no longer consider the validators as candidates, do you? Are there any drawbacks to this approach? Do we lose the guarantee that at least some peers will for sure have the needed data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I argue that mainnet validators block requests related to state sync and therefore removing them has no effect on state sync, or even makes it more reliable.
Given the low diameter of the network topology graph, such a peer should be available in TIER2. And if not, then that should be handled by the decentralized state sync separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does not work and the new design will talk to peers regardless of their status in the network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I argue that mainnet validators block requests related to state sync and therefore removing them has no effect on state sync, or even makes it more reliable.

Thank you, that makes sense.

Given the low diameter of the network topology graph, such a peer should be available in TIER2.

I believe all peers are available in TIER2. I assumed that we're talking about TIER2 direct peers here, is that right?

And if not, then that should be handled by the decentralized state sync separately.

Yep, that makes perfect sense to me. I'm not sure where exactly stateless validation is heading in terms of number of shards but let's design the solution to handle a case where it's quite likely none of the direct peers have the required state. This can happen e.g. when #shards > #direct peers.

Comment on lines -769 to -771
AccountOrPeerIdOrHash::AccountId(_) => return NetworkResponses::RouteNotFound,
AccountOrPeerIdOrHash::PeerId(peer_id) => peer_id,
AccountOrPeerIdOrHash::Hash(_) => return NetworkResponses::RouteNotFound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the point of account id and hash if it always returned RouteNotFound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AccountId used to work because @VanBarbascu changed the messages to be Direct instead of Routed.
Hash hasn't worked ever, no idea what it was for.

@nikurt nikurt enabled auto-merge October 3, 2023 10:44
@nikurt
Copy link
Contributor Author
nikurt commented Oct 3, 2023

LGTM but is it pure refactoring ?

Technically no, because it involves a change in behavior. But that behavior wasn't working anyway, therefore a refactor is appropriate :)

@nikurt nikurt added this pull request to the merge queue Oct 3, 2023
Merged via the queue into master with commit 5d4a7f3 Oct 3, 2023
@nikurt nikurt deleted the nikurt-ss-peer branch October 3, 2023 11:24
@wacban
Copy link
Contributor
wacban commented Oct 3, 2023

LGTM but is it pure refactoring ?

Technically no, because it involves a change in behavior. But that behavior wasn't working anyway, therefore a refactor is appropriate :)

Hehe, ok, shady one ;)

nikurt added a commit that referenced this pull request Oct 11, 2023
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Co-authored-by: Saketh Are <saketh.are@gmail.com>
Co-authored-by: Shreyan Gupta <shreyan@near.org>
Co-authored-by: Ekleog-NEAR <96595974+Ekleog-NEAR@users.noreply.github.com>
Co-authored-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0