Peers erroneously removed during playback

Bug Report

Setup

CometBFT version: v0.38.17

Have you tried the latest version: no (but the code is the same as main at 8583452)

ABCI app: https://github.com/provenance-io/provenance

Environment:

OS: Mac and Ubuntu
Install tools: ??
Others: ??

node command runtime flags: ??

Config

N/A

What happened?

Background: To start up a new node, our common practice is to copy the data directory of another node, then start the new node with that and let it catch back up. That data directory might be up to a day old, though, which means its got up to 20,000 blocks to replay before being back in the game.

The problem we have is that our testnet (and mainnet sometimes too) has a lot of consecutive blocks without any Txs. There isn't very much data in those blocks, though, and there's a decent amount of overhead to process the empty block which results in a low "data rate". It regularly gets "slow" enough that the peer gets stopped because "peer is not sending us data fast enough". It then just sits there doing nothing for a minute or two before it decides to try again. Then it quickly runs through a chunk of blocks until this bug gets triggered again.

It's so bad that it's sometimes impossible for a node to ever catch up; new blocks are being created faster than the node is getting the old ones (because it spends a lot of time doing nothing).

Here's the code chunk where the error is produced:

cometbft/internal/blocksync/pool.go

Lines 161 to 169 in 8583452

    
           if curRate != 0 && curRate < minRecvRate { 
        
           	err := errors.New("peer is not sending us data fast enough") 
        
           	pool.sendError(err, peer.id) 
        
           	pool.Logger.Error("SendTimeout", "peer", peer.id, 
        
           		"reason", err, 
        
           		"curRate", fmt.Sprintf("%d KB/s", curRate/1024), 
        
           		"minRate", fmt.Sprintf("%d KB/s", minRecvRate/1024)) 
        
           	peer.didTimeout = true 
        
           }

What did you expect to happen?

The node continues to receive and process blocks from peers until its caught up again.

How to reproduce it

Unfortunately, I don't have an easy way to reproduce this other than to grab an older quicksync (zipped data directory) and fire up a node.

Logs

Empty lines added for emphasis:

{"level":"info","module":"state","height":23960149,"num_txs_res":3,"num_val_updates":0,"block_app_hash":"043316A328C77FBD6B88FBE41E298EC1CA39A45FE18E8D3743B23E3E6A8B55A8","time":"2025-05-21T12:23:16-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960149,"app_hash":"043316A328C77FBD6B88FBE41E298EC1CA39A45FE18E8D3743B23E3E6A8B55A8","time":"2025-05-21T12:23:16-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960149,"block_app_hash":"695B48042721CC0CDB99085BAB1449A91276DADDE22EC12EB657EB7E0263EA75","time":"2025-05-21T12:23:16-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960150,"num_txs_res":1,"num_val_updates":0,"block_app_hash":"D09257B0133D105B156F1141E5FE6BCDBD9D39167568BA142D39CA1DB8E65501","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960150,"app_hash":"D09257B0133D105B156F1141E5FE6BCDBD9D39167568BA142D39CA1DB8E65501","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960150,"block_app_hash":"043316A328C77FBD6B88FBE41E298EC1CA39A45FE18E8D3743B23E3E6A8B55A8","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960151,"num_txs_res":3,"num_val_updates":0,"block_app_hash":"B210D19A93E5E9185D70B1C00C80681DAAAA91088B9C2BAF35624BA400599D4A","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960151,"app_hash":"B210D19A93E5E9185D70B1C00C80681DAAAA91088B9C2BAF35624BA400599D4A","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960151,"block_app_hash":"D09257B0133D105B156F1141E5FE6BCDBD9D39167568BA142D39CA1DB8E65501","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960152,"num_txs_res":1,"num_val_updates":0,"block_app_hash":"C8CA497558F584758B6DB7FC220248FD5D574D5CFEFBBA0D78A2C17ED6F1785C","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960152,"app_hash":"C8CA497558F584758B6DB7FC220248FD5D574D5CFEFBBA0D78A2C17ED6F1785C","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960152,"block_app_hash":"B210D19A93E5E9185D70B1C00C80681DAAAA91088B9C2BAF35624BA400599D4A","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960153,"num_txs_res":0,"num_val_updates":0,"block_app_hash":"F1C536CFAE7F993A8B7257AEA97026C3C5816650186D0E33DB432E185F155F47","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960153,"app_hash":"F1C536CFAE7F993A8B7257AEA97026C3C5816650186D0E33DB432E185F155F47","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960153,"block_app_hash":"C8CA497558F584758B6DB7FC220248FD5D574D5CFEFBBA0D78A2C17ED6F1785C","time":"2025-05-21T12:23:17-06:00","message":"committed state"}
{"level":"info","module":"state","height":23960154,"num_txs_res":1,"num_val_updates":0,"block_app_hash":"1DD34DC156CDB5BFF477268599EBB1E057756023EBCABE7C794CA8FB1FD32DD8","time":"2025-05-21T12:23:17-06:00","message":"finalized block"}
{"level":"info","module":"state","height":23960154,"app_hash":"1DD34DC156CDB5BFF477268599EBB1E057756023EBCABE7C794CA8FB1FD32DD8","time":"2025-05-21T12:23:17-06:00","message":"executed block"}
{"level":"info","module":"state","height":23960154,"block_app_hash":"F1C536CFAE7F993A8B7257AEA97026C3C5816650186D0E33DB432E185F155F47","time":"2025-05-21T12:23:17-06:00","message":"committed state"}

{"level":"error","module":"blocksync","peer":"fb4633e337486f9931b152f0ff254ccee6c0fce4","reason":"peer is not sending us data fast enough","curRate":"127 KB/s","minRate":"128 KB/s","time":"2025-05-21T12:23:17-06:00","message":"SendTimeout"}

{"level":"error","module":"p2p","peer":"Peer{MConn{[2a02:c206:2076:7284::2]:26656} fb4633e337486f9931b152f0ff254ccee6c0fce4 out}","err":"error with peer fb4633e337486f9931b152f0ff254ccee6c0fce4: peer is not sending us data fast enough","time":"2025-05-21T12:23:17-06:00","message":"Stopping peer for error"}
{"level":"info","module":"p2p","peer":"fb4633e337486f9931b152f0ff254ccee6c0fce4@[2a02:c206:2076:7284::2]:26656","msg":"Stopping MConnection service","impl":"MConn{[2a02:c206:2076:7284::2]:26656}","time":"2025-05-21T12:23:17-06:00","message":"service stop"}
{"level":"error","module":"p2p","peer":"Peer{MConn{[2a02:c206:2076:7284::2]:26656} fb4633e337486f9931b152f0ff254ccee6c0fce4 out}","err":"read tcp [2600:6c67:617f:4226:b44e:7ae4:e7bc:97f5]:60511->[2a02:c206:2076:7284::2]:26656: use of closed network connection","time":"2025-05-21T12:23:17-06:00","message":"Stopping peer for error"}
{"level":"info","module":"p2p","peer":"fb4633e337486f9931b152f0ff254ccee6c0fce4@[2a02:c206:2076:7284::2]:26656","msg":"Stopping Peer service","impl":"Peer{MConn{[2a02:c206:2076:7284::2]:26656} fb4633e337486f9931b152f0ff254ccee6c0fce4 out}","time":"2025-05-21T12:23:17-06:00","message":"service stop"}
{"level":"error","module":"p2p","error":"already stopped","time":"2025-05-21T12:23:17-06:00","message":"error while stopping peer"}

It was receiving blocks at about 5 per second.

`dump_consensus_state` output

N/A

Anything else we need to know

I've had others complain about this recently too recently which is why I'm now getting around to opening this issue. I've also seen folks with other chains complain about the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report

Setup

Config

What happened?

What did you expect to happen?

How to reproduce it

Logs

`dump_consensus_state` output

Anything else we need to know

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if curRate != 0 && curRate < minRecvRate {
	err := errors.New("peer is not sending us data fast enough")
	pool.sendError(err, peer.id)
	pool.Logger.Error("SendTimeout", "peer", peer.id,
	"reason", err,
	"curRate", fmt.Sprintf("%d KB/s", curRate/1024),
	"minRate", fmt.Sprintf("%d KB/s", minRecvRate/1024))
	peer.didTimeout = true
	}

Description

Bug Report

Setup

Config

What happened?

What did you expect to happen?

How to reproduce it

Logs

dump_consensus_state output

Anything else we need to know

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`dump_consensus_state` output