8000 feat: only set nodes as active after state sync by jakmeier · Pull Request #319 · sig-net/mpc · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: only set nodes as active after state sync #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 22, 2025

Conversation

jakmeier
Copy link
Contributor

This introduces NodeState::Syncing per connection on the mesh level.

The SyncTask will take peers in this state and prioritize syncing them. After it was synced, this is reported back to the mesh using a message channel.

This PR tries to introduce this new state without changing too much architecture and code. I'm not super clear about it yet, how exactly the code would be refactored best. For now, this seems to be good enough to me but I am very much open to ideas.

@jakmeier jakmeier requested a review from ChaoticTempest April 10, 2025 19:30
Comment on lines 108 to 109
NodeStatus::Syncing
// TODO: (Find out in review) Is my own ID also part of this list?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw in other places that participants can include the own ID. Do we need to worry about this here? We wouldn't want to trigger a sync with ourselves... @ChaoticTempest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the mesh connections contains us, so we need to add a check against participant == me here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so in the pool we also maintain a NodeConnection to self? Shouldn't we filter that when we spawn the tasks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so right now, Mesh doesn't have a notion of me: Participant, so unable to filter out that connection right now. We could try adding it in via the NodeStateWatcher but likely won't behave well with initial KeyGen so I avoided adding it for now. Sync at least has the me info, so we can filter it there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I have this covered now with a check in the sync task, right inside the spawned task of broadcast_sync. It's a bit ugly but in turn we keep to things simple in this PR, I guess.

jakmeier added a commit to jakmeier/mpc that referenced this pull request Apr 10, 2025
@jakmeier jakmeier requested a review from ChaoticTempest April 10, 2025 20:23
ChaoticTempest
ChaoticTempest previously approved these changes Apr 10, 2025
Copy link
Contributor
@ChaoticTempest ChaoticTempest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I like this approach where we're only going to generate protocols if its stable via syncing. However probably need to add this later, but only signature generation relies on stable set. Triple and Presignature generation rely on active set right now

@jakmeier
Copy link
Contributor Author

So far, I like this approach where we're only going to generate protocols if its stable via syncing. However probably need to add this later, but only signature generation relies on stable set. Triple and Presignature generation rely on active set right now

Yes, good point. I noticed that too when I wrote this PR. I think it should be easy to add them after this is merged.

But anyway, as long as they don't respond back with a list of what they still have from our owned presignatures / tripples, this isn't actually providing much value. The main thing we can prevent by waiting for sync to finish is weird race conditions in cases where sync removes data in just the wrong moment.

@jakmeier
Copy link
Contributor Author

I see this warning many times while running integrations tests:

WARN mpc_node::protocol::message: some participants are not active even though mesh says they are not_active={Participant(0), Participant(1)}
WARN mpc_node::protocol::message: some participants are not active even though mesh says they are not_active={Participant(1), Participant(2)}
...

from here:

if !not_active.is_empty() {
tracing::warn!(
?not_active,
"some participants are not active even though mesh says they are"
);
}

Should I worry about that? @ChaoticTempest

@ChaoticTempest
Copy link
Contributor

shouldn't be too big of a concern. Those messages should only pop up during keygen. It's because some nodes haven't started their keygen yet, and one node has fully reached the point where it has started. The message is a bit of a misnomer because keygen uses all participants regardless if they're active or not

ChaoticTempest
ChaoticTempest previously approved these changes Apr 10, 2025
@jakmeier
Copy link
Contributor Author

Hm but there are some test failures. Not sure if I find time before next week to look into them.

2025-04-10T21:38:26.8552080Z failures:
2025-04-10T21:38:26.8552185Z     cases::test_batch_duplicate_signature
2025-04-10T21:38:26.8552268Z     cases::test_batch_random_signature
2025-04-10T21:38:26.8552346Z     cases::test_key_derivation
2025-04-10T21:38:26.8552428Z     cases::test_lake_congestion
2025-04-10T21:38:26.8552508Z     cases::test_multichain_reshare
2025-04-10T21:38:26.8552622Z     cases::test_multichain_reshare_with_lake_congestion
2025-04-10T21:38:26.8552712Z     cases::test_multichain_update_contract
2025-04-10T21:38:26.8552792Z     cases::test_signature_basic
2025-04-10T21:38:26.8552867Z     cases::test_signature_many
2025-04-10T21:38:26.8552947Z     cases::test_signature_offline_node
2025-04-10T21:38:26.8553056Z     cases::test_signature_offline_node_back_online
2025-04-10T21:38:26.8553130Z     cases::test_signature_rogue
2025-04-10T21:38:26.8553134Z 
2025-04-10T21:38:26.8553389Z test result: FAILED. 3 passed; 12 failed; 1 ignored; 0

@ChaoticTempest
Copy link
Contributor

I'll take a look and try to fix it

@ChaoticTempest
Copy link
Contributor

ahh, it's because of keygen again. Since now we have Inactive which does not add participants to the active set, while Generating state needs this

@jakmeier
Copy link
Contributor Author

Thanks @ChaoticTempest for finding this! I changed it back to the old behaviour but one test still fails

2025-04-14T15:34:06.8275951Z failures:
2025-04-14T15:34:06.8276148Z     cases::test_multichain_reshare_with_lake_congestion
2025-04-14T15:34:06.8276299Z 
2025-04-14T15:34:06.8276618Z test result: FAILED. 14 passed; 1 failed; 1 ignored; 0

@ChaoticTempest
Copy link
Contributor

that one is a bit flaky sometime. Let me rerun the CI to see

@ChaoticTempest
Copy link
Contributor

looks like this change affects resharing when there's latency. Still looking into why, but going from resharing to running makes the test fail where we don't have enough stable participants to generate a signature

@ChaoticTempest
Copy link
Contributor

I think the issue with this is that now we have a race condition where some nodes observe stable to be different than other nodes meaning some nodes are synced faster than others. And then when they have their own version of stable, they incorrectly start up a signature generation protocol and timeout. This is made more clear when there's latency.

So I don't think we can add this in until we have consensus with Init messages then

@volovyks
Copy link
Contributor

@ChaoticTempest that is what I was talking about at our last meeting. Only proposer should decide who is stable and can participate in the protocol. All other protocol participants should blindly participate (or accept protocol start message) without checking of proposer is a stable connection.

@ChaoticTempest
Copy link
Contributor

talked with @volovyks offline to clarify, but what he was proposing is that we deterministically choose proposer on the list of all participants instead of just stable and have a queue of proposers in the case a signature generation fails. @jakmeier I remember you saying it's better to choose only the online ones.

The issue with the queue of proposers is that it creates a way for a participant to eventually be proposer and look ahead of time which retry count where they are a proposer. This might open up an attack where a rogue participant can hijack and be the one to publish a signature, so I don't think the queue works well here.

Also some more context on the issue I observed, we immediately try to do generation on the next list of stable participants resharing, but since sync is happening in the background the list of stable will be incorrect up until we have a full list. SignQueue also doesn't reorganize failed requests, so that means once we index the request and organize the list of participants for that request, the participants ends up staying static. So we can either try and fix SignQueue to retry with a different list of participants each time, add in a separate node state after resharing for synchronizing such that we don't start SignatureManager until we're in Running, or come up with a better consensus for proposer/request-participant-list.

@jakmeier
Copy link
Contributor Author

Thanks @ChaoticTempest !

I think the issue with this is that now we have a race condition where some nodes observe stable to be different than other nodes meaning some nodes are synced faster than others. And then when they have their own version of stable, they incorrectly start up a signature generation protocol and timeout. This is made more clear when there's latency.

It believe this race condition is not new, it just now gets caught by the tests due to additional latency.
That's also in line with this test being flaky already before my changes.

Overall, we must be able to deal with cases where a part of the network is unstable and different nodes having a different view on that.

So I don't think we can add this in until we have consensus with Init messages then

Sounds good to me.
We might also find more missing checks that should be done before starting a protocol.
This PR can wait on the sideline until all things are in order.

Only proposer should decide who is stable and can participate in the protocol. All other protocol participants should blindly participate (or accept protocol start message) without checking of proposer is a stable connection.

@volovyks Yes, agreed. Whether the proposer is stable or not shouldn't change our decision to participate. We should primarily only check that the request is valid (indexed and final).

The issue with the queue of proposers is that it creates a way for a participant to eventually be proposer and look ahead of time which retry count where they are a proposer. This might open up an attack where a rogue participant can hijack and be the one to publish a signature, so I don't think the queue works well here.

@ChaoticTempest I see what you mean. But the protocol is not byzantine fault tolerant at the moment and I don't think we are ready to tackle that, yet. Let's focus on something that's reliable under nice condition first.
Also, forcing to be the proposer shouldn't be a security issue. Worst case, the MPC network generates the same signature more than once.

But you are right, this might become a problem in the future. We will need a byzantine-tolerant proposer election at some point.

So we can either try and fix SignQueue to retry with a different list of participants each time, add in a separate node state after resharing for synchronizing such that we don't start SignatureManager until we're in Running, or come up with a better consensus for proposer/request-participant-list.

I think we should start with:

  1. Init message to ensure everyone uses the same list
  2. Change the participant list on any retry (I believe that what you do in feat: reorganize sign request participants on failure #321)
  3. Also change the proposer on a longer timeout (that could be based on a proposer queue)

Hopefully this makes things reliable enough that we can have different views on the network per node without it breaking down. Otherwise, we will have to search deeper.

@jakmeier
Copy link
Contributor Author
jakmeier commented Apr 17, 2025

@jakmeier I remember you saying it's better to choose only the online ones.

@ChaoticTempest Yeah but only when the proposer selects the list of participants. Everyone else must trust the list sent by the proposer and not interfere with it in any way.

This introduces NodeState::Syncing per connection on the mesh level.

The SyncTask will take peers in this state and prioritize syncing them.
After it was synced, this is reported back to the mesh using a message channel.

This PR tries to introduce this new state without changing too much architecture and code. I'm not super clear about it yet, how exactly the code
would be refactored best. For now, this seems to be good enough to me
but I am very much open to ideas.
@jakmeier jakmeier force-pushed the check_sync_before_active branch from 440b34e to f0055d2 Compare April 24, 2025 10:39
feat: make only inactive => active be syncable
@jakmeier jakmeier merged commit 929854b into sig-net:develop May 22, 2025
1 of 3 checks passed
697B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0