8000 `(*pex.addrBook).Wait()` never called, occasionally continues writing after `(*node.Node).Wait()` returns · Issue #646 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

(*pex.addrBook).Wait() never called, occasionally continues writing after (*node.Node).Wait() returns #646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mark-rushakoff opened this issue Apr 5, 2023 · 3 comments · Fixed by #2010
Assignees
Labels
bug Something isn't working p2p
Milestone

Comments

@mark-rushakoff
Copy link
Contributor

Bug Report

Setup

CometBFT version (use cometbft version or git rev-parse --verify HEAD if installed from source): v0.37.0

Have you tried the latest version: no

ABCI app (name for built-in, URL for self-written if it's publicly available): simapp in Cosmos SDK

Environment:

  • OS (e.g. from /etc/os-release): go test inside macOS
  • Install tools:
  • Others:

node command runtime flags:

Config

Default config.

What happened?

In the Cosmos SDK, I am working on a set of tests that run a real CometBFT *node.Node in-process along with a simapp instance.

The tests defer node.Stop(); node.Wait(). Usually they pass.

Occasionally, there are failures like: testing.go:1206: TempDir RemoveAll cleanup: unlinkat /var/folders/.../TestCometStarter_PortContentionattempt_0278742771/001/config: directory not empty. If I switch away from using t.TempDir so that I can inspect the config directory which isn't empty, it always contains addrbook.json.

If I print debug.Stack() from inside (*pex.addrBook).saveToFile, I see that the final call writing out addrbook.jsonis the finala.saveToFilecall in(*pex.addrBook).saveRoutine`:

func (a *addrBook) saveRoutine() {
defer a.wg.Done()
saveFileTicker := time.NewTicker(dumpAddressInterval)
out:
for {
select {
case <-saveFileTicker.C:
a.saveToFile(a.filePath)
case <-a.Quit():
break out
}
}
saveFileTicker.Stop()
a.saveToFile(a.filePath)
}

At the top of that function you will see a defer a.wg.Done() call. If we call (*pex.addrBook).Wait() (which simply calls a.wg.Wait()), then we would block until a.SaveRoutine returned.

However... nothing in the codebase calls (*pex.addrBook).Wait(). If you comment out the function, go test -short ./... passes without any compilation errors.

What did you expect to happen?

How to reproduce it

$ git clone -b mr/testnet-node-stop-wait --depth=1 https://github.com/mark-rushakoff/cosmos-sdk
$ cd cosmos-sdk/simapp
$ while GORACE=halt_on_error=1 GODEBUG=tracebackancestors=3 go test ./internal/testnet/ -run=Contention -count=10 -failfast -race -v; do date; done

There are usually several successful runs before it fails.

Logs

n/a

dump_consensus_state output

n/a

Anything else we need to know

The simplest solution looks like it would be:

  1. Update the pex.AddrBook interface to include a Wait() method.
  2. Update (*pex.addrBook).Wait() to call a.BaseService.Wait(); a.wg.Wait()`.
  3. Add a new Wait() method to *node.Node which calls n.BaseService.Wait(); n.addrBook.Wait()

Patching CometBFT v0.37.0 with those changes appears to fully prevent the non-empty temp directory test failure that I could previously reproduce.

diff --git a/node/node.go b/node/node.go
index 4e5238d1d..be6af893f 100644
--- a/node/node.go
+++ b/node/node.go
@@ -1313,6 +1313,11 @@ func (n *Node) IsListening() bool {
        return n.isListening
 }

+func (n *Node) Wait() {
+       n.BaseService.Wait()
+       n.addrBook.Wait()
+}
+
 // NodeInfo returns the Node's Info from the Switch.
 func (n *Node) NodeInfo() p2p.NodeInfo {
        return n.nodeInfo
diff --git a/p2p/pex/addrbook.go b/p2p/pex/addrbook.go
index 857a6a063..86933616b 100644
--- a/p2p/pex/addrbook.go
+++ b/p2p/pex/addrbook.go
@@ -35,6 +35,8 @@ const (
 // peers to dial.
 // TODO: break this up?
 type AddrBook interface {
+       Wait()
+
        service.Service

        // Add our own addresses so we don't later add ourselves
@@ -171,6 +173,7 @@ func (a *addrBook) OnStop() {

 func (a *addrBook) Wait() {
        a.wg.Wait()
+       a.B
8000
aseService.Wait()
 }

 func (a *addrBook) FilePath() string {
@mark-rushakoff mark-rushakoff added bug Something isn't working needs-triage This issue/PR has not yet been triaged by the team. labels Apr 5, 2023
@cason
Copy link
Contributor
cason commented Apr 5, 2023

Legit.

For the solution I would propose something different. I think the change, which is needed, has to be made in the PEX reactor, which in fact controls the address book. It is up to the PEX reactor, to stop the address book in a proper way, waiting for the persistence routine to stop.

More specifically and for context, the address book is created by the Node constructor and is passed to the p2.Switch (node.createAddrBookAndSetOnSwitch method), then passed to the PEX reactor (node.createPEXReactorAndAddToSwitch). The PEX reactor is a p2p.Reactor as any other, started and stopped by the p2p.Switch. The PEX reactor startup method p2p.pex.Reactor.OnStart() starts the address book. The PEX reactor shutdown method p2p.pex.Reactor.OnStop() stops the address book. The address book implements, in the same way as the reactors, service.Service by extending service.BaseService. So it contains the same OnStart() and OnStop() methods.

@cason cason removed the needs-triage This issue/PR has not yet been triaged by the team. label Apr 5, 2023
@mark-rushakoff
Copy link
Contributor Author

Thanks @cason, I filed #652 at the same time that you posted this comment. Both this issue and that one are likely better fixed in one change to the PEX reactor like you've pointed out.

@cason cason added the p2p label Apr 5, 2023
@cason
Copy link
Contributor
cason commented Apr 5, 2023

This bug is very likely to be observed on main and v0.34.x branches as well, since the changes on the affected files of the p2p package are minor.

melekes added a commit that referenced this issue Jan 10, 2024
during addrBook#Stop
Closes #646
@adizere adizere added this to the 2024-Q1 milestone Jan 10, 2024
@adizere adizere added this to CometBFT Jan 11, 2024
@github-project-automation github-project-automation bot moved this to Todo in CometBFT Jan 11, 2024
@adizere adizere moved this from Todo to Ready for Review in CometBFT Jan 11, 2024
github-merge-queue bot pushed a commit that referenced this issue Feb 6, 2024
Closes #646,
#652

<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in CometBFT Feb 6, 2024
mergify bot pushed a commit that referenced this issue Feb 6, 2024
Closes #646,
#652

<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code
comments~~

(cherry picked from commit 8a0a2f6)

# Conflicts:
#	p2p/pex/pex_reactor.go
melekes added a commit that referenced this issue Feb 8, 2024
fix(p2p/pex): gracefully shutdown < 8000 a title="fix(p2p/pex): gracefully shutdown `Reactor` (#2010) Closes https://github.com/cometbft/cometbft/issues/646, https://github.com/cometbft/cometbft/issues/652 <!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> --- #### PR checklist - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code comments~~" data-pjax="true" class="Link--secondary markdown-title" href="/cometbft/cometbft/commit/8f5552b96bd047cec8aa8f9ea9805472c1433e82">Reactor (#2010)
Closes #646,
#652

<!--

Please add a reference to the issue that this PR addresses and indicate
which
files are most critical to review. If it fully addresses a particular
issue,
please include "Closes #XXX" (where "XXX" is the issue number).

If this PR is non-trivial/large/complex, please ensure that you have
either
created an issue that the team's had a chance to respond to, or had some
discussion with the team prior to submitting substantial pull requests.
The team
can be reached via GitHub Discussions or the Cosmos Network Discord
server in
the #cometbft channel. GitHub Discussions is preferred over Discord as
it
allows us to keep track of conversations topically.
https://github.com/cometbft/cometbft/discussions

If the work in this PR is not aligned with the team's current
priorities, please
be advised that it may take some time before it is merged - especially
if it has
not yet been discussed with the team.

See the project board for the team's current priorities:
https://github.com/orgs/cometbft/projects/1

-->

---

#### PR checklist

- [ ] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p2p
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants
0