-
Notifications
You must be signed in to change notification settings - Fork 191
Intermittent CI failure: assertion tripped when run under Valgrind #730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Different but presuambly related failure here: https://github.com/luvit/luv/actions/runs/12801148295/job/35690138656?pr=746
This seems suspiciously like stack corruption/use-after-free/race condition or something like that. Impossible error codes, impossible |
Something else that's strange about this is the The expected output for the
|
To add to the above wrt impossible error codes: It's because libuv returns the wrong thing for errors. It's returning A result of errno -115 tells us nothing more than one of the pthread functions failed, we will not know why or which until libuv starts providing the correct error information. |
Was able to reproduce the failure locally with a patched libuv (will send a PR upstream, EDIT: libuv/libuv#4782). Here's the real error:
|
Something funny came to mind now that I just looked at the actual test case, since we're running this in Valgrind there is a serious chance that the thread finishes execution before we get around to setting its priority. We should probably be synchronizing the start of the thread with the rest of the test case so that the test is guaranteed to flow: create -> default priority -> set priority -> check priority in main -> check priority in thread -> exit thread -> join thread (probably by notifying the thread once after we set priority in main). And this is probably a general problem with all of the threads tests. |
Makes sense. Adding some debug printing has ruled out things getting corrupted before the failing
The threading/async stuff is the area of luv I have the least experience with, so some help in fixing these tests would be much appreciated. EDIT: Reproduction info if it's helpful:
(basically just what the CI does) Doesn't reproduce 100% of the time, more like 1/3 of the time. |
Well the simple solution would be create a semaphore with an initial value of zero, call However that brings in the difficulty of we have none of the libuv synchronization primitives bound, which would include If synchronization is a 8000 ctually something you'd like to consider I'd be willing to put together a pull request with all of the synchronization stuff from libuv but it may be opening a bag of worms we may not want. |
That also makes sense. That's exactly what the libuv If the thread API can't be properly tested without those synchronization primitive bindings, then it seems like it'd be worthwhile to have them bound. Could maybe start with a proof-of-concept with just I've mostly let @zhaozg deal with async/threading, though, so I'd like to get his opinion on this as well. |
I have a working proof of concept that fixes this race condition using a semaphore but the farther I got into implementing the more and more concerned I've gotten about the thread safety in luv. Will require further inspecting and I'll split that out into another issue. I couldn't reproduce the issue locally without forcing a sleep on the main thread to mimic the failure condition but with this patch to the test valgrind succeeds on
The next concern is that it appears setpriority is doing nothing, because |
When running that test locally:
|
@truemedian would you like to PR the |
Submitted a PR for it but it exposed some rather concerning underlying problems about how threads work in luv. In particular around how the internals of handles are designed around an extra layer of pointer indirection to allow them to be passed across threads despite the fact that libuv states that only uv_async_t has any thread safety guarantees and therefore every other handle type cannot be safely passed to any other thread. I'm in the process of going through and figuring out what exactly is broken and in the process might end up with just a large patch to improve/update/fix everything I find. |
Do you think it's worth merging the If it would be better to wait, we could just disable the flaky |
The semaphore bindings are useful if ever so slightly, and it's mostly self contained. I have no idea how long what I'm working on with take, but I foresee at least a week (probably more) and probably way more modifications than anyone is comfortable seeing. |
No rush, I'm satisfied that we at least know what's happening with the intermittent valgrind failures. Do you think the |
The interface shouldn't need to change, there really isn't much to bind there. I'm fairly confident that I can assert the interface will never need to change (unless I massively overlooked something) Additionally, I'm pretty sure everything I want to change is under the hood and should have no interface-breaking concerns. |
Split from #721
This assertion was tripped in a CI run of our tests here: https://github.com/luvit/luv/actions/runs/11649845591/job/32437850489
It occurred during the "getpriority, setpriority" test, and a re-run fixed it so it is an intermittent failure.
Our own tests tripping this assertion seems like bad news.
Same failure here and (presumably, logs have been cleared) here
Not sure why it's only getting tripped when run under Valgrind.
The text was updated successfully, but these errors were encountered: