8000 Peers erroneously removed during playback · Issue #5135 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Peers erroneously removed during playback #5135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SpicyLemon opened this issue May 21, 2025 · 1 comment
Open

Peers erroneously removed during playback #5135

SpicyLemon opened this issue May 21, 2025 · 1 comment
Labels
bug Something isn't working needs-triage This issue/PR has not yet been triaged by the team.

Comments

@SpicyLemon
Copy link
Contributor

Bug Report

Setup

CometBFT version: v0.38.17

Have you tried the latest version: no (but the code is the same as main at 8583452)

ABCI app: https://github.com/provenance-io/provenance

Environment:

  • OS: Mac and Ubuntu
  • Install tools: ??
  • Others: ??

node command runtime flags: ??

Config

N/A

What happened?

Background: To start up a new node, our common practice is to copy the data directory of another node, then start the new node with that and let it catch back up. That data directory might be up to a day old, though, which means its got up to 20,000 blocks to replay before being back in the game.

The problem we have is that our testnet (and mainnet sometimes too) has a lot of consecutive blocks without any Txs. There isn't very much data in those blocks, though, and there's a decent amount of overhead to process the empty block which results in a low "data rate". It regularly gets "slow" enough that the peer gets stopped because "peer is not sending us data fast enough". It then just sits there doing nothing for a minute or two before it decides to try again. Then it quickly runs through a chunk of blocks until this bug gets triggered again.

It's so bad that it's sometimes impossible for a node to ever catch up; new blocks are being created faster than the node is getting the old ones (because it spends a lot of time doing nothing).

Here's the code chunk where the error is produced:

if curRate != 0 && curRate < minRecvRate {
err := errors.New("peer is not sending us data fast enough")
pool.sendError(err, peer.id)
pool.Logger.Error("SendTimeout", "peer", peer.id,
"reason", err,
"curRate", fmt.Sprintf("%d KB/s", curRate/1024),
"minRate", fmt.Sprintf("%d KB/s", minRecvRate/1024))
peer.didTimeout = true
}

What did you expect to happen?

The node continues to receive and process blocks from peers until its caught up again.

How to reproduce it

Unfortunately, I don't have an easy way to reproduce this other than to grab an older quicksync (zipped data directory) and fire up a node.

Logs

Empty lines added for emphasis:

{"level":"info","module":"state","height":23960149,"num_txs_res":3,"num_val_updates":0,"block_app_hash":"043316A328C77FBD6B88FBE41E298EC1CA39A45FE18E8D3743B23E3E6A8B55A8","time":"2025-05-21T12:23:16-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960149,"app_hash":"043316A328C77FBD6B88FBE41E298EC1CA39A45FE18E8D3743B23E3E6A8B55A8","time":"2025-05-21T12:23:16-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960149,"block_app_hash":"695B48042721CC0CDB99085BAB1449A91276DADDE22EC12EB657EB7E0263EA75","time":"2025-05-21T12:23:16-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960150,"num_txs_res":1,"num_val_updates":0,"block_app_hash":"D09257B0133D105B156F1141E5FE6BCDBD9D39167568BA142D39CA1DB8E65501","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960150,"app_hash":"D09257B0133D105B156F1141E5FE6BCDBD9D39167568BA142D39CA1DB8E65501","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960150,"block_app_hash":"043316A328C77FBD6B88FBE41E298EC1CA39A45FE18E8D3743B23E3E6A8B55A8","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960151,"num_txs_res":3,"num_val_updates":0,"block_app_hash":"B210D19A93E5E9185D70B1C00C80681DAAAA91088B9C2BAF35624BA400599D4A","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960151,"app_hash":"B210D19A93E5E9185D70B1C00C80681DAAAA91088B9C2BAF35624BA400599D4A","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960151,"block_app_hash":"D09257B0133D105B156F1141E5FE6BCDBD9D39167568BA142D39CA1DB8E65501","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960152,"num_txs_res":1,"num_val_updates":0,"block_app_hash":"C8CA497558F584758B6DB7FC220248FD5D574D5CFEFBBA0D78A2C17ED6F1785C","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960152,"app_hash":"C8CA497558F584758B6DB7FC220248FD5D574D5CFEFBBA0D78A2C17ED6F1785C","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960152,"block_app_hash":"B210D19A93E5E9185D70B1C00C80681DAAAA91088B9C2BAF35624BA400599D4A","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960153,"num_txs_res":0,"num_val_updates":0,"block_app_hash":"F1C536CFAE7F993A8B7257AEA97026C3C5816650186D0E33DB432E185F155F47","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960153,"app_hash":"F1C536CFAE7F993A8B7257AEA97026C3C5816650186D0E33DB432E185F155F47","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960153,"block_app_hash":"C8CA497558F584758B6DB7FC220248FD5D574D5CFEFBBA0D78A2C17ED6F1785C","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960154,"num_txs_res":1,"num_val_updates":0,"block_app_hash":"1DD34DC156CDB5BFF477268599EBB1E057756023EBCABE7C794CA8FB1FD32DD8","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960154,"app_hash":"1DD34DC156CDB5BFF477268599EBB1E057756023EBCABE7C794CA8FB1FD32DD8","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960154,"block_app_hash":"F1C536CFAE7F993A8B7257AEA97026C3C5816650186D0E33DB432E185F155F47","time":"2025-05-21T12:23:17-06:00","message":"committed state"}

{"level":"error","module":"blocksync","peer":"fb4633e337486f9931b152f0ff254ccee6c0fce4","reason":"peer is not sending us data fast enough","curRate":"127 KB/s","minRate":"128 KB/s","time":"2025-05-21T12:23:17-06:00","message":"SendTimeout"}

{"level":"error","module":"p2p","peer":"Peer{MConn{[2a02:c206:2076:7284::2]:26656} fb4633e337486f9931b152f0ff254ccee6c0fce4 out}","err":"error with peer fb4633e337486f9931b152f0ff254ccee6c0fce4: peer is not sending us data fast enough","time":"2025-05-21T12:23:17-06:00","message":"Stopping peer for error"}
{"level":"info","module":"p2p","peer":"fb4633e337486f9931b152f0ff254ccee6c0fce4@[2a02:c206:2076:7284::2]:26656","msg":"Stopping MConnection service","impl":"MConn{[2a02:c206:2076:7284::2]:26656}","time":"2025-05-21T12:23:17-06:00","message":"service stop"}
{"level":"error","module":"p2p","peer":"Peer{MConn{[2a02:c206:2076:7284::2]:26656} fb4633e337486f9931b152f0ff254ccee6c0fce4 out}","err":"read tcp [2600:6c67:617f:4226:b44e:7ae4:e7bc:97f5]:60511->[2a02:c206:2076:7284::2]:26656: use of closed network connection","time":"2025-05-21T12:23:17-06:00","message":"Stopping peer for error"}
{"level":"info","module":"p2p","peer":"fb4633e337486f9931b152f0ff254ccee6c0fce4@[2a02:c206:2076:7284::2]:26656","msg":"Stopping Peer service","impl":"Peer{MConn{[2a02:c206:2076:7284::2]:26656} fb4633e337486f9931b152f0ff254ccee6c0fce4 out}","time":"2025-05-21T12:23:17-06:00","message":"service stop"}
{"level":"error","module":"p2p","error":"already stopped","time":"2025-05-21T12:23:17-06:00","message":"error while stopping peer"}

It was receiving blocks at about 5 per second.

dump_consensus_state output

N/A

Anything else we need to know

I've had others complain about this recently too recently which is why I'm now getting around to opening this issue. I've also seen folks with other chains complain about the issue.

@SpicyLemon SpicyLemon added bug Something isn't working needs-triage This issue/PR has not yet been triaged by the team. labels May 21, 2025
Copy link
linear bot commented May 21, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage This issue/PR has not yet been triaged by the team.
Projects
None yet
Development

No branches or pull requests

1 participant
0