8000 fix: Concurrent map read/write in Tree methods by motoki317 · Pull Request #1238 · go-git/go-git · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: Concurrent map read/write in Tree methods #1238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: previous-main
Choose a base branch
from

Conversation

motoki317
Copy link
@motoki317 motoki317 commented Dec 5, 2024

The "tree path cache" internal field in Tree struct is preventing Tree methods to be called from multiple goroutines concurrently - producing "concurrent map read and map write" panics.
This PR disables this internal cache and allows the same Tree object methods to be called concurrently.

IMHO, this type of cache is a bad practice in Go and it is better to utilize multiple goroutines to make processing things faster.
Of course, unless the performance gain by this cache was measured thoroughly and this cache definitely improves performance, but I don't think this is the case.

In my project usecase, disabling this cache resulted in a faster performance of my tool. (see commit.)

Thanks!

@motoki317
Copy link
Author

@pjbgf Hello! Excuse me for tagging you, but is there any chance this change could be reviewed?

@pjbgf
Copy link
Member
pjbgf commented Jan 5, 2025

@motoki317 thank you for your PR. Changes like this are better off targeting the v6-exp branch which will be our next major release.

Would you be able to provide a benchmark test showing the performance improvements?

@motoki317
Copy link
Author

@pjbgf Thanks for taking a look at this.

I don't think this change is a breaking one, but should I still target the v6 branch? I think it'd be better to import this into v5 (and maybe to v6 as well) if v6 is going to take a bit longer.
(In fact, I'm using my forked go-git by using replace in go.mod in my project. See PR description)

For benchmark, I could try adding one, but I'm not sure how to go about benchmarking this.
Ultimately, the performance gain (or loss) by this change is really dependent on the structure of real projects; a "normal" project with not many (around tens) files per directory and not that deep structure would, I think, benefit from this change. I'm not sure if any project/usecase would suffer from this change, but I don't think there's many of such projects and the performance impact should be unnoticeable.

@motoki317
Copy link
Author

At least in my usecase, this change resulted in at least a few times faster processing time (a few hundreds ms to a few seconds gain), in most of real projects (I'm mining a lot of OSS repositories), using parallel calls to these methods.

@pjbgf
Copy link
Member
pjbgf commented Jan 24, 2025

@motoki317 the v5 release is frozen, changes to it are now mostly bug fixes or bumps for dependencies. Changes such as this will need to go against main, for the v6 release.

At least in my usecase, this change resulted in at least a few times faster processing time (a few hundreds ms to a few seconds gain), in most of real projects (I'm mining a lot of OSS repositories), using parallel calls to these methods.

It is common for performance improvements to have side-effects, sometimes the slower processing times causes higher allocations, or memory usage, which leads to slower long-running applications due to GC pressure.

For example this commit made clones 15% slower, however it reduced the memory cost by 77%. By running Go benchmarks and pprof we can understand what are the trade-offs before making such decisions.

@motoki317 motoki317 changed the base branch from master to main January 24, 2025 10:02
@motoki317
Copy link
Author

@pjbgf Thank you for the response.

I see, I have updated the base branch to main if you say so.

As for the benchmark, I added a (rough-cut) benchmark code above, but I seem to be getting file: already closed or zlib: invalid header errors like this, if I try to call FindEntry() concurrently:

    repository_test.go:3622: error reading utils/merkletrie/change_test.go: read /tmp/BenchmarkTree_FindEntry_Parallel4239159960/001/objects/pack/pack-af8e45f7de79ee1ca5d7f555bf5868880425ba96.pack: file already closed
    repository_test.go:3622: error reading utils/merkletrie/difftree.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/format/packfile/types.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/format/commitgraph/v2/commitgraph.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/object.go: zlib: invalid header
    repository_test.go:3622: error reading utils/merkletrie/filesystem/node.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/protocol/packp/capability/capability.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/protocol/packp/ulreq_encode.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/protocol/packp/ulreq_test.go: zlib: invalid header
    repository_test.go:3622: error reading plumbing/format/config/common_test.go: zlib: invalid header

This doesn't occur in my tool's usecase, so I may be missing something here, or it may be just that go-git's implementation is not really thread-safe at many places other than this.

Nevertheless, when I ignore these errors, I obtained the following results. I run benchmarks on both main and in my branch, and both on github.com/go-git/go-git and github.com/torvalds/linux (as an example of a larger repository).

v6 main
$ go test -run=^$ -test.bench ^BenchmarkTree_FindEntry$ -benchmem
goos: linux
goarch: amd64
pkg: github.com/go-git/go-git/v5
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkTree_FindEntry-24            99          11549066 ns/op        27529918 B/op      46009 allocs/op
PASS
ok      github.com/go-git/go-git/v5     2.403s

v6 fix/race-in-tree
$ go test -run=^$ -test.bench ^BenchmarkTree_FindEntry -benchmem
goos: linux
goarch: amd64
pkg: github.com/go-git/go-git/v5
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkTree_FindEntry-24                    20          56532524 ns/op        146480136 B/op    264872 allocs/op
BenchmarkTree_FindEntry_Parallel-24           50          20483664 ns/op        128628483 B/op    192093 allocs/op
PASS
ok      github.com/go-git/go-git/v5     5.109s

v6 main (github.com/torvalds/linux)
$ go test -run=^$ -test.bench ^BenchmarkTree_FindEntry$ -benchmem
goos: linux
goarch: amd64
pkg: github.com/go-git/go-git/v5
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkTree_FindEntry-24             2         551947129 ns/op        1216659400 B/op  2608151 allocs/op
PASS
ok      github.com/go-git/go-git/v5     3.247s

v6 fix/race-in-tree (github.com/torvalds/linux)
$ go test -run=^$ -test.bench ^BenchmarkTree_FindEntry -benchmem
goos: linux
goarch: amd64
pkg: github.com/go-git/go-git/v5
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkTree_FindEntry-24                     1        13438525538 ns/op       36206744264 B/op        129991164 allocs/op
BenchmarkTree_FindEntry_Parallel-24            1        2665111342 ns/op        33130471120 B/op        111777776 allocs/op
PASS
ok      github.com/go-git/go-git/v5     17.653s

which is telling me that this change is making the code slower, at least for this benchmark code.

However, I have checked twice now that this change indeed improves performance in my usecase, by checking performance before and after my tool's commit here.
In a specific configuration, my tool's runtime improved from 16s to 14s.
I guess the performance improvement in my usecase comes from the fact that I need to "preload" all files sequentially due to the fact that go-git's methods are not thread-safe, so the "preloading" affects the tool's runtime. On the other hand, if go-git's methods become thread-safe (after applying this PR's patch), it no longer needs to be "preloaded" and I can call it from any goroutine anytime, so parallelism improves and the overall runtime improves.

So, actually I am not sure what to do now. I'll probably keep using this patched go-git in my tool, and I may improve the "preload" implementation so it doesn't become a bottleneck in my tool and report it back here (when I have bandwidth to do so).
Nevertheless, I think it's making it hard for go-git to be used in Golang codes if its methods are not thread-safe, like in my example, so IMO this patch is still worth considering. There's probably many other non-thread-safe code to be fixed if you're going to make methods thread-safe though.

@pjbgf
Copy link
Member
pjbgf commented Jan 24, 2025

@motoki317 thank you again for the follow-up on this.

go-git's implementation is not really thread-safe at many places

That is correct. There are thread-safety issues and the idea is to start finding and fixing them as part of v6. Ideally the performance improvements done over time, should off-set any potential performance cost we may encounter.

which is telling me that this change is making the code slower, at least for this benchmark code.

Unfortunately, the failures could actually mean that not all code is being executed, which invalidate the benchmark results. Let me try to take a look at this in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0