feat: only set nodes as active after state sync #319

jakmeier · 2025-04-10T19:30:40Z

This introduces NodeState::Syncing per connection on the mesh level.

The SyncTask will take peers in this state and prioritize syncing them. After it was synced, this is reported back to the mesh using a message channel.

This PR tries to introduce this new state without changing too much architecture and code. I'm not super clear about it yet, how exactly the code would be refactored best. For now, this seems to be good enough to me but I am very much open to ideas.

jakmeier · 2025-04-10T19:33:43Z

chain-signatures/node/src/mesh/connection.rs

+                            NodeStatus::Syncing
+                            // TODO: (Find out in review) Is my own ID also part of this list?


I saw in other places that participants can include the own ID. Do we need to worry about this here? We wouldn't want to trigger a sync with ourselves... @ChaoticTempest

yep, the mesh connections contains us, so we need to add a check against participant == me here

Interesting, so in the pool we also maintain a NodeConnection to self? Shouldn't we filter that when we spawn the tasks?

so right now, Mesh doesn't have a notion of me: Participant, so unable to filter out that connection right now. We could try adding it in via the NodeStateWatcher but likely won't behave well with initial KeyGen so I avoided adding it for now. Sync at least has the me info, so we can filter it there

Okay, I think I have this covered now with a check in the sync task, right inside the spawned task of broadcast_sync. It's a bit ugly but in turn we keep to things simple in this PR, I guess.

chain-signatures/node/src/protocol/sync/mod.rs

chain-signatures/node/src/mesh/mod.rs

chain-signatures/node/src/mesh/connection.rs

See discussion here: sig-net#319 (comment)

ChaoticTempest

So far, I like this approach where we're only going to generate protocols if its stable via syncing. However probably need to add this later, but only signature generation relies on stable set. Triple and Presignature generation rely on active set right now

jakmeier · 2025-04-10T20:32:04Z

So far, I like this approach where we're only going to generate protocols if its stable via syncing. However probably need to add this later, but only signature generation relies on stable set. Triple and Presignature generation rely on active set right now

Yes, good point. I noticed that too when I wrote this PR. I think it should be easy to add them after this is merged.

But anyway, as long as they don't respond back with a list of what they still have from our owned presignatures / tripples, this isn't actually providing much value. The main thing we can prevent by waiting for sync to finish is weird race conditions in cases where sync removes data in just the wrong moment.

jakmeier · 2025-04-10T20:47:15Z

I see this warning many times while running integrations tests:

WARN mpc_node::protocol::message: some participants are not active even though mesh says they are not_active={Participant(0), Participant(1)}
WARN mpc_node::protocol::message: some participants are not active even though mesh says they are not_active={Participant(1), Participant(2)}
...

from here:

mpc/chain-signatures/node/src/protocol/message/mod.rs

Lines 961 to 966 in 0e22123

    
           if !not_active.is_empty() { 
        
               tracing::warn!( 
        
                   ?not_active, 
        
                   "some participants are not active even though mesh says they are" 
        
               ); 
        
           }

Should I worry about that? @ChaoticTempest

ChaoticTempest · 2025-04-10T21:21:46Z

shouldn't be too big of a concern. Those messages should only pop up during keygen. It's because some nodes haven't started their keygen yet, and one node has fully reached the point where it has started. The message is a bit of a misnomer because keygen uses all participants regardless if they're active or not

jakmeier · 2025-04-11T07:10:55Z

Hm but there are some test failures. Not sure if I find time before next week to look into them.

2025-04-10T21:38:26.8552080Z failures:
2025-04-10T21:38:26.8552185Z     cases::test_batch_duplicate_signature
2025-04-10T21:38:26.8552268Z     cases::test_batch_random_signature
2025-04-10T21:38:26.8552346Z     cases::test_key_derivation
2025-04-10T21:38:26.8552428Z     cases::test_lake_congestion
2025-04-10T21:38:26.8552508Z     cases::test_multichain_reshare
2025-04-10T21:38:26.8552622Z     cases::test_multichain_reshare_with_lake_congestion
2025-04-10T21:38:26.8552712Z     cases::test_multichain_update_contract
2025-04-10T21:38:26.8552792Z     cases::test_signature_basic
2025-04-10T21:38:26.8552867Z     cases::test_signature_many
2025-04-10T21:38:26.8552947Z     cases::test_signature_offline_node
2025-04-10T21:38:26.8553056Z     cases::test_signature_offline_node_back_online
2025-04-10T21:38:26.8553130Z     cases::test_signature_rogue
2025-04-10T21:38:26.8553134Z 
2025-04-10T21:38:26.8553389Z test result: FAILED. 3 passed; 12 failed; 1 ignored; 0

ChaoticTempest · 2025-04-11T20:57:12Z

I'll take a look and try to fix it

ChaoticTempest · 2025-04-11T22:28:33Z

ahh, it's because of keygen again. Since now we have Inactive which does not add participants to the active set, while Generating state needs this

jakmeier · 2025-04-14T15:37:48Z

Thanks @ChaoticTempest for finding this! I changed it back to the old behaviour but one test still fails

2025-04-14T15:34:06.8275951Z failures:
2025-04-14T15:34:06.8276148Z     cases::test_multichain_reshare_with_lake_congestion
2025-04-14T15:34:06.8276299Z 
2025-04-14T15:34:06.8276618Z test result: FAILED. 14 passed; 1 failed; 1 ignored; 0

ChaoticTempest · 2025-04-14T17:31:28Z

that one is a bit flaky sometime. Let me rerun the CI to see

ChaoticTempest · 2025-04-15T01:32:57Z

looks like this change affects resharing when there's latency. Still looking into why, but going from resharing to running makes the test fail where we don't have enough stable participants to generate a signature

ChaoticTempest · 2025-04-15T20:40:16Z

I think the issue with this is that now we have a race condition where some nodes observe stable to be different than other nodes meaning some nodes are synced faster than others. And then when they have their own version of stable, they incorrectly start up a signature generation protocol and timeout. This is made more clear when there's latency.

So I don't think we can add this in until we have consensus with Init messages then

volovyks · 2025-04-16T11:15:36Z

@ChaoticTempest that is what I was talking about at our last meeting. Only proposer should decide who is stable and can participate in the protocol. All other protocol participants should blindly participate (or accept protocol start message) without checking of proposer is a stable connection.

ChaoticTempest · 2025-04-16T16:30:33Z

talked with @volovyks offline to clarify, but what he was proposing is that we deterministically choose proposer on the list of all participants instead of just stable and have a queue of proposers in the case a signature generation fails. @jakmeier I remember you saying it's better to choose only the online ones.

The issue with the queue of proposers is that it creates a way for a participant to eventually be proposer and look ahead of time which retry count where they are a proposer. This might open up an attack where a rogue participant can hijack and be the one to publish a signature, so I don't think the queue works well here.

Also some more context on the issue I observed, we immediately try to do generation on the next list of stable participants resharing, but since sync is happening in the background the list of stable will be incorrect up until we have a full list. SignQueue also doesn't reorganize failed requests, so that means once we index the request and organize the list of participants for that request, the participants ends up staying static. So we can either try and fix SignQueue to retry with a different list of participants each time, add in a separate node state after resharing for synchronizing such that we don't start SignatureManager until we're in Running, or come up with a better consensus for proposer/request-participant-list.

jakmeier · 2025-04-17T11:34:47Z

Thanks @ChaoticTempest !

I think the issue with this is that now we have a race condition where some nodes observe stable to be different than other nodes meaning some nodes are synced faster than others. And then when they have their own version of stable, they incorrectly start up a signature generation protocol and timeout. This is made more clear when there's latency.

It believe this race condition is not new, it just now gets caught by the tests due to additional latency.
That's also in line with this test being flaky already before my changes.

Overall, we must be able to deal with cases where a part of the network is unstable and different nodes having a different view on that.

So I don't think we can add this in until we have consensus with Init messages then

Sounds good to me.
We might also find more missing checks that should be done before starting a protocol.
This PR can wait on the sideline until all things are in order.

Only proposer should decide who is stable and can participate in the protocol. All other protocol participants should blindly participate (or accept protocol start message) without checking of proposer is a stable connection.

@volovyks Yes, agreed. Whether the proposer is stable or not shouldn't change our decision to participate. We should primarily only check that the request is valid (indexed and final).

The issue with the queue of proposers is that it creates a way for a participant to eventually be proposer and look ahead of time which retry count where they are a proposer. This might open up an attack where a rogue participant can hijack and be the one to publish a signature, so I don't think the queue works well here.

@ChaoticTempest I see what you mean. But the protocol is not byzantine fault tolerant at the moment and I don't think we are ready to tackle that, yet. Let's focus on something that's reliable under nice condition first.
Also, forcing to be the proposer shouldn't be a security issue. Worst case, the MPC network generates the same signature more than once.

But you are right, this might become a problem in the future. We will need a byzantine-tolerant proposer election at some point.

So we can either try and fix SignQueue to retry with a different list of participants each time, add in a separate node state after resharing for synchronizing such that we don't start SignatureManager until we're in Running, or come up with a better consensus for proposer/request-participant-list.

I think we should start with:

Init message to ensure everyone uses the same list
Change the participant list on any retry (I believe that what you do in feat: reorganize sign request participants on failure #321)
Also change the proposer on a longer timeout (that could be based on a proposer queue)

Hopefully this makes things reliable enough that we can have different views on the network per node without it breaking down. Otherwise, we will have to search deeper.

jakmeier · 2025-04-17T11:37:33Z

@jakmeier I remember you saying it's better to choose only the online ones.

@ChaoticTempest Yeah but only when the proposer selects the list of participants. Everyone else must trust the list sent by the proposer and not interfere with it in any way.

This introduces NodeState::Syncing per connection on the mesh level. The SyncTask will take peers in this state and prioritize syncing them. After it was synced, this is reported back to the mesh using a message channel. This PR tries to introduce this new state without changing too much architecture and code. I'm not super clear about it yet, how exactly the code would be refactored best. For now, this seems to be good enough to me but I am very much open to ideas.

See discussion here: sig-net#319 (comment)

feat: make only inactive => active be syncable

jakmeier requested a review from ChaoticTempest April 10, 2025 19:30

jakmeier commented Apr 10, 2025

View reviewed changes

chain-signatures/node/src/protocol/sync/mod.rs Show resolved Hide resolved

ChaoticTempest reviewed Apr 10, 2025

View reviewed changes

chain-signatures/node/src/mesh/mod.rs Outdated Show resolved Hide resolved

ChaoticTempest reviewed Apr 10, 2025

View reviewed changes

chain-signatures/node/src/mesh/connection.rs Outdated Show resolved Hide resolved

jakmeier added a commit to jakmeier/mpc that referenced this pull request Apr 10, 2025

check that we never try to sync with self

b48dc67

See discussion here: sig-net#319 (comment)

jakmeier requested a review from ChaoticTempest April 10, 2025 20:23

ChaoticTempest previously approved these changes Apr 10, 2025

View reviewed changes

jakmeier dismissed ChaoticTempest’s stale review via f3b0e23 April 10, 2025 20:44

ChaoticTempest previously approved these changes Apr 10, 2025

View reviewed changes

jakmeier dismissed ChaoticTempest’s stale review via 683575b April 14, 2025 13:18

jakmeier force-pushed the check_sync_before_active branch from 683575b to 440b34e Compare April 14, 2025 14:48

volovyks mentioned this pull request Apr 16, 2025

Release 1.2.0 near/mpc_old#943

Open

ChaoticTempest mentioned this pull request Apr 17, 2025

feat: reorganize sign request participants on failure #321

Merged

jakmeier added 2 commits April 24, 2025 12:25

only trigger sync on change from inactive to active

c43a98c

jakmeier added 3 commits April 24, 2025 12:25

check that we never try to sync with self

d21f79b

See discussion here: sig-net#319 (comment)

fix keygen

323aa52

try to fix sync integration test

f0055d2

jakmeier force-pushed the check_sync_before_active branch from 440b34e to f0055d2 Compare April 24, 2025 10:39

Only running transition should be active

0d7b9cb

ChaoticTempest mentioned this pull request May 20, 2025

feat: make only inactive => active be syncable jakmeier/mpc#1

Merged

Merge pull request #1 from sig-net/phuong/feat/check-sync-before-active

c0ec9c5

feat: make only inactive => active be syncable

ChaoticTempest approved these changes May 22, 2025

View reviewed changes

jakmeier merged commit 929854b into sig-net:develop May 22, 2025
1 of 3 checks passed

697B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: only set nodes as active after state sync #319

feat: only set nodes as active after state sync #319

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		NodeStatus::Syncing
		// TODO: (Find out in review) Is my own ID also part of this list?

feat: only set nodes as active after state sync #319

feat: only set nodes as active after state sync #319

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!