8000 PEX reactor does not have a `Wait()` method, lacks way to wait for goroutine started in `OnStart()` · Issue #652 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
PEX reactor does not have a Wait() method, lacks way to wait for goroutine started in OnStart() #652
Closed
@mark-rushakoff

Description

@mark-rushakoff

Bug Report

Setup

CometBFT version v0.37.0

Have you tried the latest version: no

ABCI app (name for built-in, URL for self-written if it's publicly available): Cosmos SDK simapp

Environment:

  • OS (e.g. from /etc/os-release): go test in macOS
  • Install tools:
  • Others:

node command runtime flags:

Config

Default config.

What happened?

Using the same setup as #646 -- a test that instantiates multiple *node.Node instances, starts them, checks something, then calls Stop() and Wait() on each Node -- and locally applying the patch mentioned in the issue to eliminate that test failure, the test almost always passes. On very rare occasions, there is a failure due to a log occurring after the test has ended. The stack looks like this:

panic: Log in goroutine after TestCometStarter_PortContention pex_reactor.go:458: <nil> INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10 rootmodule=comet_node-0

goroutine 1969 [running]:
testing.(*common).logDepth(0xc0015d4ea0, {0xc0129c1a40, 0xd9}, 0x3)
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1003 +0x4b4
testing.(*common).log(...)
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:985
testing.(*common).Logf(0xc0015d4ea0, {0x1030b4ff5, 0xb}, {0xc004c5c600, 0x4, 0x4})
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1036 +0x84
github.com/rs/zerolog.TestWriter.Write({{0x131279e18?, 0xc0015d4ea0?}, 0xc00b130350?}, {0xc003ae5040, 0xb7, 0x1a0})
        /Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/writer.go:140 +0x358
bytes.(*Buffer).WriteTo(0xc00cf66c30, {0x10509a820, 0xc0063cc3f0})
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/bytes/buffer.go:252 +0xe4
github.com/rs/zerolog.ConsoleWriter.Write({{0x10509a820, 0xc0063cc3f0}, 0x0, {0x1030ac8a7, 0x6}, {0xc0063c2280, 0x4, 0x4}, {0x0, 0x0, ...}, ...}, ...)
        /Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/console.go:145 +0x6d4
github.com/rs/zerolog.levelWriterAdapter.WriteLevel(...)
        /Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/writer.go:25
github.com/rs/zerolog.(*Event).write(0xc0124a2d20)
        /Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/event.go:78 +0x244
github.com/rs/zerolog.(*Event).msg(0xc0124a2d20, {0x1030b6dde, 0xc})
        /Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/event.go:149 +0x30c
github.com/rs/zerolog.(*Event).Msg(...)
        /Users/hh/go/pkg/mod/github.com/rs/zerolog@v1.29.0/event.go:108
cosmossdk.io/log.zeroLogWrapper.Info({0xc00da7dc48?}, {0x1030b6dde?, 0xc?}, {0xc00b130280?, 0x8?, 0x8?})
        /Users/hh/go/pkg/mod/cosmossdk.io/log@v1.0.0/logger.go:85 +0xf0
github.com/cometbft/cometbft/p2p/pex.(*Reactor).ensurePeers(0xc0029d9ec0)
        /Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:458 +0x37c
github.com/cometbft/cometbft/p2p/pex.(*Reactor).ensurePeersRoutine(0xc0029d9ec0)
        /Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:433 +0xd8
created by github.com/cometbft/cometbft/p2p/pex.(*Reactor).OnStart
        /Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:165 +0x228
[originating from goroutine 2019]:
github.com/cometbft/cometbft/p2p/pex.(*Reactor).OnStart(...)
        /Users/hh/go/src/github.com/cometbft/cometbft/p2p/pex/pex_reactor.go:167 +0x228
github.com/cometbft/cometbft/libs/service.(*BaseService).Start(...)
        /Users/hh/go/src/github.com/cometbft/cometbft/libs/service/service.go:145 +0x4bc
github.com/cometbft/cometbft/p2p.(*Switch).OnStart(...)
        /Users/hh/go/src/github.com/cometbft/cometbft/p2p/switch.go:235 +0xb4
github.com/cometbft/cometbft/libs/service.(*BaseService).Start(...)
        /Users/hh/go/src/github.com/cometbft/cometbft/libs/service/service.go:145 +0x4bc
github.com/cometbft/cometbft/node.(*Node).OnStart(...)
        /Users/hh/go/src/github.com/cometbft/cometbft/node/node.go:979 +0x54c
github.com/cometbft/cometbft/libs/service.(*BaseService).Start(...)
        /Users/hh/go/src/github.com/cometbft/cometbft/libs/service/service.go:145 +0x4bc
github.com/cosmos/cosmos-sdk/testutil/testnet.(*CometStarter).Start(...)
        /Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/cometstarter.go:133 +0x2b0
github.com/cosmos/cosmos-sdk/testutil/testnet.NewNetwork.func1(...)
        /Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/network.go:42 +0xa0
created by github.com/cosmos/cosmos-sdk/testutil/testnet.NewNetwork
        /Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/network.go:38 +0xb4
[originating from goroutine 2018]:
github.com/cosmos/cosmos-sdk/testutil/testnet.NewNetwork(...)
        /Users/hh/go/src/github.com/cosmos/cosmos-sdk/testutil/testnet/network.go:36 +0xb4
cosmossdk.io/simapp/internal/testnet_test.TestCometStarter_PortContention.func1(...)
        /Users/hh/go/src/github.com/cosmos/cosmos-sdk/simapp/internal/testnet/cometstarter_test.go:93 +0x1ec
testing.tRunner(...)
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1579 +0x184
created by testing.(*T).Run
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1629 +0x5b8
[originating from goroutine 1985]:
testing.(*T).Run(...)
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1630 +0x5b8
cosmossdk.io/simapp/internal/testnet_test.TestCometStarter_PortContention(...)
        /Users/hh/go/src/github.com/cosmos/cosmos-sdk/simapp/internal/testnet/cometstarter_test.go:92 +0x430
testing.tRunner(...)
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1579 +0x184
created by testing.(*T).Run
        /opt/homebrew/Cellar/go/1.20.1/libexec/src/testing/testing.go:1629 +0x5b8
FAIL    cosmossdk.io/simapp/internal/testnet    15.743s
FAIL

Upon inspection, it looks there is a racy condition in the select in (*pex.Reactor).ensurePeersRoutine:

func (r *Reactor) ensurePeersRoutine() {
var (
seed = cmtrand.NewRand()
jitter = seed.Int63n(r.ensurePeersPeriod.Nanoseconds())
)
// Randomize first round of communication to avoid thundering herd.
// If no peers are present directly start connecting so we guarantee swift
// setup with the help of configured seeds.
if r.nodeHasSomePeersOrDialingAny() {
time.Sleep(time.Duration(jitter))
}
// fire once immediately.
// ensures we dial the seeds right away if the book is empty
r.ensurePeers()
// fire periodically
ticker := time.NewTicker(r.ensurePeersPeriod)
for {
select {
case <-ticker.C:
r.ensurePeers()
case <-r.Quit():
ticker.Stop()
return
}
}
}

If the ticker fires immediately before the quit channel is closed, then (*pex.Reactor).Stop() may return immediately. The PEX reactor does not offer a way to wait for the ensurePeersRoutine to complete, so despite calling (*node.Node).Wait(), that goroutine may, and in rare occasions does, write to the reactor's logger after (*node.Node).Wait() returns, thereby causing a test failure.

What did you expect to happen?

How to reproduce it

$ git clone -b mr/testnet-node-stop-wait --depth=1 https://github.com/mark-rushakoff/cosmos-sdk
$ cd cosmos-sdk/simapp
$ # Manually pply patch from bottom of opening comment in #646
$ while GORACE=halt_on_error=1 GODEBUG=tracebackancestors=3 go test ./internal/testnet/ -run=Contention -count=10 -failfast -race -v; do date; done
$ # Wait for a few minutes for test failure to occur

Logs

n/a

dump_consensus_state output

n/a

Anything else we need to know

The suggested fix in #646 may be better implemented as a Wait() method on the PEX reactor; but at this point, it might be best to make Wait() a method on the base service interface so that any other reactor that runs background goroutines can offer a way to wait for them to complete.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingp2p

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0