-
Notifications
You must be signed in to change notification settings - Fork 3
fix: remove dedupping the sign requests based on signId #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes deduplication of sign requests based on signId, reflecting the updated requirement that subsystems no longer need to filter duplicate requests.
- Removed the contains method and its usage in the sign queue
- Eliminated the unique sign requests metric from the metrics module
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
chain-signatures/node/src/protocol/signature.rs | Removed deduplication logic from the sign queue |
chain-signatures/node/src/metrics.rs | Removed the metric counter for unique sign requests |
Comments suppressed due to low confidence (2)
chain-signatures/node/src/protocol/signature.rs:176
- [nitpick] Since deduplication is intentionally removed, consider adding a comment near this block to highlight that duplicate requests are now expected and handled elsewhere.
if self.contains(&indexed.id) {
chain-signatures/node/src/metrics.rs:496
- With the removal of deduplication logic, update the metric documentation accordingly and verify that any monitoring or alerting systems are adjusted for the removal of this counter.
pub(crate) static NUM_UNIQUE_SIGN_REQUESTS: LazyLock<CounterVec> = LazyLock::new(|| {
tracing::info!(sign_id = ?indexed.id, "skipping sign request: already in the sign queue"); | ||
continue; | ||
} | ||
crate::metrics::NUM_UNIQUE_SIGN_REQUESTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to remove this metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly because I thought we don't care about this. Because the traffic we send for ethereum actually are half dupes, and we would want to respond to them even if they are dupes right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your logic. This is signature.rs, it is used not only for Ethereum requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this file is for all sign requests. But this dedup logic causes problem when a duplicate request comes in before its previous request is completely finished.
For example, both node 1 and 2 have request A in their sign queue, node 1 has taken it off the sign queue and started working on it, but node 2 has not received message to join protocol so request A still in node2's sign queue. Now a dupe of request A comes in, and it will be removed as a duplicate for node2, but will be added to sign queue for node1. And when the first request A has finished generation of signature, you can imagine, node 1 will not have the 2nd request A, but node 2 will start request A.
Since we receive a lot of duplicate eth requests (likely due to retrying on our load test side), this inconsistency becomes more severe for eth requests.
@@ -183,14 +176,6 @@ impl SignQueue { | |||
other => other, | |||
} | |||
} { | |||
if self.contains(&indexed.id) { | |||
tracing::info!(sign_id = ?indexed.id, "skipping sign request: already in the sign queue"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check seams important. We are planning to have many different indexing solutions with different architecture.
Should we convert it to warn and keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason why I'm removing this logic is because we are currently skipping dupes and not generating signatures for them. But the traffic we receive for eth are half dupes. Also another issue is different nodes receive eth requests at slightly different times, and they often turn out to have different set of requests after dedupping, and that may cause some signatures don't have certain nodes joining protocol because that request is not in their sign queue.
This logic was originally added because for eth, the client will poll for the signId, so if multiple requests came in at almost same time, responding to one of them will have all clients receive the respond event for that signId. But this will make grafana tracking very hard, and it is a waste of our load testing logic as well because we won't respond to the dupe logic anyways.
I think in practice not skipping dupes at all is a cleaner solution and it achieves same experience for the user. So I'm removing this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are half dupes exactly?
Also another issue is different nodes receive eth requests at slightly different times, and they often turn out to have different set of requests after dedupping, and that may cause some signatures don't have certain nodes joining protocol because that request is not in their sign queue.
Duplicate requests should not have us end up with a sign queue with different requests from each other node so something is wrong here. It sounds like our sign queue is not well constrained or that eth indexer is not properly constructing reqeusts. Either ways, we should have a unit test for sign queue to track the behavior we're intending here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if you look at multichain_sign_requests_count_unique for eth, it is typically around half of multichain_sign_requests_count for eth. I think it's likely when we send eth traffic, we have a lot of retries, and we might end up sending the request with same parameters multiple times, and we will get duplicate requests that won't be handled by nodes.
Duplicate requests should not have us end up with a sign queue with different requests from each other node so something is wrong here. It sounds like our sign queue is not well constrained or that eth indexer is not properly constructing reqeusts.
I think it can happen when a duplicate request comes in before its previous request is completely finished. For example, both node 1 and 2 have request A in their sign queue, node 1 has taken it off the sign queue and started working on it, but node 2 has not received message to join protocol so request A still in node2's sign queue. Now a dupe of request A comes in, and it will be removed as a duplicate for node2, but will be added to sign queue for node1. And when the first request A has finished generation of signature, you can imagine, node 1 will not have the 2nd request A, but node 2 will start request A.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment t 8000 o others. Learn more.
I see, then I think we should move to the sign queue not removing the request until it's fully completed. We can add additional info like if the request is in progress or not, that way duplicates do not create additional entries in the queue. Additionally, we should also store the completed request sign id in an LruCache or something similar, such that we don't receive the same request after we're done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense. But it remains a problem that each node completes the signature protocol at different times, and the same problem will still happen.....
Shall we merge this first and then look into if a better fix is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean it's still a problem if it completes at different times? Wouldn't my suggestion with storing the completed request/sign_id be enough to resolve this? Or is this something else I'm missing.
I would err on the side of keeping the dedup logic since fulfilling and responding to duplicate requests can be expensive, especially on eth
We used to dedup sign requests based on signIds because our eth indexer used to have to go back to the last processed block and resubscribe. Since we do not have resubscribe logic any more, we do not need to dedup