Description
Bug Report
Setup
CometBFT version v0.37.0
Have you tried the latest version: no
ABCI app (name for built-in, URL for self-written if it's publicly available): Cosmos SDK simapp
Environment:
- OS (e.g. from /etc/os-release):
go test
in macOS - Install tools:
- Others:
node command runtime flags:
Config
Default config.
What happened?
Using the same setup as #646 -- a test that instantiates multiple *node.Node
instances, starts them, checks something, then calls Stop()
and Wait()
on each Node -- and locally applying the patch mentioned in the issue to eliminate that test failure, the test almost always passes. On very rare occasions, there is a failure due to a log occurring after the test has ended. The stack looks like this:
panic: Log in goroutine after TestCometStarter_PortContention pex_reactor.go:458: <nil> INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10 rootmodule=comet_node-0
goroutine 1969 [running]:
testing.(*common).logDepth(0xc0015d4ea0, {0xc0129c1a40, 0xd9}, 0x3)
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1003 +0x4b4
testing.(*common).log(...)
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:985
testing.(*common).Logf(0xc0015d4ea0, {0x1030b4ff5, 0xb}, {0xc004c5c600, 0x4, 0x4})
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1036 +0x84
github.com/rs/zerolog.TestWriter.Write({{0x131279e18?, 0xc0015d4ea0?}, 0xc00b130350?}, {0xc003ae5040, 0xb7, 0x1a0})
/Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/writer.go:140 +0x358
bytes.(*Buffer).WriteTo(0xc00cf66c30, {0x10509a820, 0xc0063cc3f0})
/opt/homebrew/Cellar/go/1.20.1/libexec/src/bytes/buffer.go:252 +0xe4
github.com/rs/zerolog.ConsoleWriter.Write({{0x10509a820, 0xc0063cc3f0}, 0x0, {0x1030ac8a7, 0x6}, {0xc0063c2280, 0x4, 0x4}, {0x0, 0x0, ...}, ...}, ...)
/Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/console.go:145 +0x6d4
github.com/rs/zerolog.levelWriterAdapter.WriteLevel(...)
/Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/writer.go:25
github.com/rs/zerolog.(*Event).write(0xc0124a2d20)
/Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/event.go:78 +0x244
github.com/rs/zerolog.(*Event).msg(0xc0124a2d20, {0x1030b6dde, 0xc})
/Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/event.go:149 +0x30c
github.com/rs/zerolog.(*Event).Msg(...)
/Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/event.go:108
cosmossdk.io/log.zeroLogWrapper.Info({0xc00da7dc48?}, {0x1030b6dde?, 0xc?}, {0xc00b130280?, 0x8?, 0x8?})
/Users/hh/go/pkg/mod/cosmossdk.io/log@v1.0.0/logger.go:85 +0xf0
github.com/cometbft/cometbft/p2p/pex.(*Reactor).ensurePeers(0xc0029d9ec0)
/Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:458 +0x37c
github.com/cometbft/cometbft/p2p/pex.(*Reactor).ensurePeersRoutine(0xc0029d9ec0)
/Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:433 +0xd8
created by github.com/cometbft/cometbft/p2p/pex.(*Reactor).OnStart
/Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:165 +0x228
[originating from goroutine 2019]:
github.com/cometbft/cometbft/p2p/pex.(*Reactor).OnStart(...)
/Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:167 +0x228
github.com/cometbft/cometbft/libs/service.(*BaseService).Start(...)
/Users/hh/go/src/github.com/cometbft/cometbft/libs/service/service.go:145 +0x4bc
github.com/cometbft/cometbft/p2p.(*Switch).OnStart(...)
/Users/hh/go/src/github.com/cometbft/cometbft/p2p/switch.go:235 +0xb4
github.com/cometbft/cometbft/libs/service.(*BaseService).Start(...)
/Users/hh/go/src/github.com/cometbft/cometbft/libs/service/service.go:145 +0x4bc
github.com/cometbft/cometbft/node.(*Node).OnStart(...)
/Users/hh/go/src/github.com/cometbft/cometbft/node/node.go:979 +0x54c
github.com/cometbft/cometbft/libs/service.(*BaseService).Start(...)
/Users/hh/go/src/github.com/cometbft/cometbft/libs/service/service.go:145 +0x4bc
github.com/cosmos/cosmos-sdk/testutil/testnet.(*CometStarter).Start(...)
/Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/cometstarter.go:133 +0x2b0
github.com/cosmos/cosmos-sdk/testutil/testnet.NewNetwork.func1(...)
/Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/network.go:42 +0xa0
created by github.com/cosmos/cosmos-sdk/testutil/testnet.NewNetwork
/Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/network.go:38 +0xb4
[originating from goroutine 2018]:
github.com/cosmos/cosmos-sdk/testutil/testnet.NewNetwork(...)
/Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/network.go:36 +0xb4
cosmossdk.io/simapp/internal/testnet_test.TestCometStarter_PortContention.func1(...)
/Users/hh/go/src/github.com/cosmos/cosmos-sdk/simapp/internal/testnet/cometstarter_test.go:93 +0x1ec
testing.tRunner(...)
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1579 +0x184
created by testing.(*T).Run
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1629 +0x5b8
[originating from goroutine 1985]:
testing.(*T).Run(...)
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1630 +0x5b8
cosmossdk.io/simapp/internal/testnet_test.TestCometStarter_PortContention(...)
/Users/hh/go/src/github.com/cosmos/cosmos-sdk/simapp/internal/testnet/cometstarter_test.go:92 +0x430
testing.tRunner(...)
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1579 +0x184
created by testing.(*T).Run
/opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1629 +0x5b8
FAIL cosmossdk.io/simapp/internal/testnet 15.743s
FAIL
Upon inspection, it looks there is a racy condition in the select
in (*pex.Reactor).ensurePeersRoutine
:
cometbft/p2p/pex/pex_reactor.go
Lines 418 to 446 in 365b0a7
If the ticker fires immediately before the quit channel is closed, then (*pex.Reactor).Stop()
may return immediately. The PEX reactor does not offer a way to wait for the ensurePeersRoutine
to complete, so despite calling (*node.Node).Wait()
, that goroutine may, and in rare occasions does, write to the reactor's logger after (*node.Node).Wait()
returns, thereby causing a test failure.
What did you expect to happen?
How to reproduce it
$ git clone -b mr/testnet-node-stop-wait --depth=1 https://github.com/mark-rushakoff/cosmos-sdk
$ cd cosmos-sdk/simapp
$ # Manually pply patch from bottom of opening comment in #646
$ while GORACE=halt_on_error=1 GODEBUG=tracebackancestors=3 go test ./internal/testnet/ -run=Contention -count=10 -failfast -race -v; do date; done
$ # Wait for a few minutes for test failure to occur
Logs
n/a
dump_consensus_state
output
n/a
Anything else we need to know
The suggested fix in #646 may be better implemented as a Wait()
method on the PEX reactor; but at this point, it might be best to make Wait()
a method on the base service interface so that any other reactor that runs background goroutines can offer a way to wait for them to complete.