8000 libn/networkdb: fix data race in GetTableByNetwork by corhere · Pull Request #49937 · moby/moby · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

libn/networkdb: fix data race in GetTableByNetwork #49937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025

Conversation

corhere
Copy link
Contributor
@corhere corhere commented May 7, 2025

- What I did
- How I did it

The function was accessing the index map without holding the mutex, so it would race any mutation to the database indexes. Fetch the reference to the tree's root while holding a read lock. Since the radix tree is immutable, taking a reference to the root is equivalent to starting a read-only database transaction, providing a consistent view of the data at a snapshot in time, even as the live state is mutated concurrently.

Also optimize the WalkTable function by leveraging the immutability of the radix tree.

- How to verify it

go test -race

- Human readable description for the release notes

Fix an issue where `docker network inspect --verbose` could sometimes crash the daemon

- A picture of a cute animal (not mandatory but encouraged)

The function was accessing the index map without holding the mutex, so
it would race any mutation to the database indexes. Fetch the reference
to the tree's root while holding a read lock. Since the radix tree is
immutable, taking a reference to the root is equivalent to starting a
read-only database transaction, providing a consistent view of the data
at a snapshot in time, even as the live state is mutated concurrently.

Also optimize the WalkTable function by leveraging the immutability of
the radix tree.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@robmry
Copy link
Contributor
robmry commented May 7, 2025

I saw the modprobe-not-found failures in one rootless run yesterday too ...

=== FAIL: amd64.integration.network.bridge TestAllPortMappingsAreReturned (2.78s)
=== FAIL: amd64.integration.network.bridge TestLegacyLink (6.38s)
=== FAIL: amd64.integration.network.bridge TestRemoveLegacyLink (4.41s)

Not sure why they'd be flaky - modprobe not found seems like something that should happen always or never. So, it'll need looking at.)

8000 Copy link
Member
@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +594 to 596
params := strings.Split(string(path[1:]), "/")
nid := params[1]
key := params[2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but would this be something we could replace for strings.Cut, or do we expect more than one / (and only need the first two components here)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices start at 0. params is a 3-tuple; we are extracting the second and third fields. I guess we could do something like the following if we really wanted to.

nid, key, _ = strings.Cut(string(path[len(tname)+2]), "/")

Comment on lines +432 to +445
done := make(chan struct{})
defer close(done)
for _, db := range dbs {
go func(db *NetworkDB) {
for {
select {
case <-done:
return
default:
}
_ = db.GetTableByNetwork("test_table", "network1")
}
}(db)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely non-blocking; guess we could make this use a context to make it slight more to the point

// Shake out any data races.
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
for _, db := range dbs {
	go func(db *NetworkDB) {
		for ctx.Err() == nil {
			_ = db.GetTableByNetwork("test_table", "network1")
		}
	}(db)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I was just looking at that one earlier Today, but we need to wait for go1.24 (which should be possible now that 1.24.3 is released). I was looking for it to be used in our integration tests, but we have a setupTest that depends on the TestMain to set up a common context, which doesn't allow creating one (testing.M doesn't have the Context() 😂

Comment on lines +432 to +434
nDB.RLock()
root := nDB.indexes[byNetwork].Root()
nDB.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; so we have a read-lock here, but there's no lock in 2 places I found that mutate this?

func (nDB *NetworkDB) createOrUpdateEntry(nid, tname, key string, v *entry) (okTable bool, okNetwork bool) {
nDB.indexes[byTable], _, okTable = nDB.indexes[byTable].Insert([]byte(fmt.Sprintf("/%s/%s/%s", tname, nid, key)), v)
nDB.indexes[byNetwork], _, okNetwork = nDB.indexes[byNetwork].Insert([]byte(fmt.Sprintf("/%s/%s/%s", nid, tname, key)), v)

func (nDB *NetworkDB) deleteEntry(nid, tname, key string) (okTable bool, okNetwork bool) {
nDB.indexes[byTable], _, okTable = nDB.indexes[byTable].Delete([]byte(fmt.Sprintf("/%s/%s/%s", tname, nid, key)))
nDB.indexes[byNetwork], _, okNetwork = nDB.indexes[byNetwork].Delete([]byte(fmt.Sprintf("/%s/%s/%s", nid, tname, key)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ignore me; from #44501

The major changes compared to the old radix are when modifying (insert or delete) a tree, and those are pretty self-contained: we replace the entire immutable tree under a lock.

Looks like caller is acquiring a lock indeed. although there's some odd bits because it's doing the swap inside of a nDB.indexes[byNetwork].Root().WalkPrefix([]byte("/"+nid), 🤔 .. I recall now that there was some weird stuff in that area that originally caused some panics the "mutable" radix, well 🤷

@thaJeztah thaJeztah added this to the 28.2.0 milestone May 12, 2025
@thaJeztah thaJeztah merged commit e824fed into moby:master May 12, 2025
153 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

network inspect causes fatal error: concurrent map read and map write
3 participants
0