8000 Intermittent CI failure: assertion tripped when run under Valgrind · Issue #730 · luvit/luv · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Intermittent CI failure: assertion tripped when run under Valgrind #730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
squeek502 opened this issue Nov 5, 2024 · 16 comments · Fixed by #774
Closed

Intermittent CI failure: assertion tripped when run under Valgrind #730

squeek502 opened this issue Nov 5, 2024 · 16 comments · Fixed by #774

Comments

@squeek502
Copy link
Member
squeek502 commented Nov 5, 2024

Split from #721


This assertion was tripped in a CI run of our tests here: https://github.com/luvit/luv/actions/runs/11649845591/job/32437850489

ok 126 thread - getaffinity, setaffinity
default priority	0
hello world from thread
priority in thread	0
lua: /home/runner/work/luv/luv/src/loop.c:98: luv_walk_cb: Assertion `data && data->ref < 0x1000000' failed.

It occurred during the "getpriority, setpriority" test, and a re-run fixed it so it is an intermittent failure.

Our own tests tripping this assertion seems like bad news.


Same failure here and (presumably, logs have been cleared) here

Not sure why it's only getting tripped when run under Valgrind.

@squeek502
Copy link
Member Author
squeek502 commented Jan 16, 2025

Different but presuambly related failure here: https://github.com/luvit/luv/actions/runs/12801148295/job/35690138656?pr=746

default priority	0
hello world from thread
priority in thread	0
  ./tests/test-thread.lua:180: Unknown system error -115: Unknown system error -115
  stack traceback:
  	[C]: in function 'assert'
  	./tests/test-thread.lua:180: in field 'fn'
  	./lib/tap.lua:59: in function <./lib/tap.lua:48>
  	[C]: in function 'xpcall'
  	./lib/tap.lua:48: in upvalue 'run'
  	./lib/tap.lua:146: in function 'lib/tap'
  	tests/run.lua:23: in main chunk
  	[C]: in ?
not ok 129 thread - getpriority, setpriority

This seems suspiciously like stack corruption/use-after-free/race condition or something like that. Impossible error codes, impossible data->ref values, etc.

@squeek502
Copy link
Member Author
squeek502 commented Jan 17, 2025

Something else that's strange about this is the hello world from thread output. That's the output from a completely different test (test avoid thread be released before it done), so I have no idea why it's showing up here.

The expected output for the getpriority, setpriority test is

default priority        0
priority after change   0
priority in thread      0
ok 9 getpriority, setpriority

@truemedian
Copy link
Member

To add to the above wrt impossible error codes: It's because libuv returns the wrong thing for errors. It's returning UV__ERR(errno) despite the fact that pthread functions do not return error information in errno. It should be returning UV__ERR(retval).

A result of errno -115 tells us nothing more than one of the pthread functions failed, we will not know why or which until libuv starts providing the correct error information.

@squeek502
Copy link
Member Author
squeek502 commented May 11, 2025

A result of errno -115 tells us nothing more than one of the pthread functions failed, we will not know why or which until libuv starts providing the correct error information.

Was able to reproduce the failure locally with a patched libuv (will send a PR upstream, EDIT: libuv/libuv#4782). Here's the real error:

  ./tests/test-thread.lua:180: ESRCH: no such process
  stack traceback:
        [C]: in function 'assert'
        ./tests/test-thread.lua:180: in field 'fn'
        ./lib/tap.lua:59: in function <./lib/tap.lua:48>
        [C]: in function 'xpcall'
        ./lib/tap.lua:48: in upvalue 'run'
        ./lib/tap.lua:146: in function 'lib/tap'
        tests/run.lua:23: in main chunk
        [C]: in ?

@truemedian
Copy link
Member
truemedian commented May 11, 2025

Something funny came to mind now that I just looked at the actual test case, since we're running this in Valgrind there is a serious chance that the thread finishes execution before we get around to setting its priority.

We should probably be synchronizing the start of the thread with the rest of the test case so that the test is guaranteed to flow: create -> default priority -> set priority -> check priority in main -> check priority in thread -> exit thread -> join thread (probably by notifying the thread once after we set priority in main).

And this is probably a general problem with all of the threads tests.

@squeek502
Copy link
Member Author
squeek502 commented May 11, 2025

Makes sense. Adding some debug printing has ruled out things getting corrupted before the failing setpriority call (this is just printing out the uv_thread_t to make sure it stays the same):

new: 9c22640
getprio: 9c22640
default priority        0
getprio: 9c22640
priority in thread      6
setprio: 9c22640
  ./tests/test-thread.lua:180: ESRCH: no such process

The threading/async stuff is the area of luv I have the least experience with, so some help in fixing these tests would be much appreciated.


EDIT: Reproduction info if it's helpful:

BUILD_TYPE=Debug WITH_LUA_ENGINE=Lua make
valgrind --suppressions=.ci/valgrind_mem.supp --error-exitcode=1 --leak-check=full --child-silent-after-fork=yes --keep-debuginfo=yes --track-origins=yes ./build/lua -e "collectgarbage('incremental', 0, 10000000000000)" tests/run.lua

(basically just what the CI does)

Doesn't reproduce 100% of the time, more like 1/3 of the time.

@truemedian
Copy link
Member
truemedian commented May 11, 2025

Well the simple solution would be create a semaphore with an initial value of zero, call wait in the thread (which would pause the thread because it's waiting for the semaphore to be >1) and then post (which increments the semaphore value and wakes any threads waiting on it) in main to wake the thread when we're ready.

However that brings in the difficulty of we have none of the libuv synchronization primitives bound, which would include uv_sem_t.

If synchronization is a 8000 ctually something you'd like to consider I'd be willing to put together a pull request with all of the synchronization stuff from libuv but it may be opening a bag of worms we may not want.

@squeek502
Copy link
Member Author

That also makes sense. That's exactly what the libuv test-thread-priority.c test case does.

If the thread API can't be properly tested without those synchronization primitive bindings, then it seems like it'd be worthwhile to have them bound. Could maybe start with a proof-of-concept with just uv_sem_t if that'd make things easier?

I've mostly let @zhaozg deal with async/threading, though, so I'd like to get his opinion on this as well.

@truemedian
Copy link
Member

I have a working proof of concept that fixes this race condition using a semaphore but the farther I got into implementing the more and more concerned I've gotten about the thread safety in luv. Will require further inspecting and I'll split that out into another issue.

truemedian@d8e1d9b

I couldn't reproduce the issue locally without forcing a sleep on the main thread to mimic the failure condition but with this patch to the test valgrind succeeds on

$ BUILD_TYPE=Debug WITH_LUA_ENGINE=Lua make
$ valgrind --suppressions=.ci/valgrind_mem.supp --error-exitcode=1 --leak-check=full --child-silent-after-fork=yes --keep-debuginfo=yes --track-origins=yes ./build/lua -e "collectgarbage('incremental', 0, 10000000000000)" tests/test-thread.lua
1..11
  In    uv_thread_t: 0x58246c0
  { delay = 100, elapsed = 196 }
ok 1 test thread create
ok 2 test thread create with arguments
  Running       uv_thread_t: 0x4bad740
  Running       uv_thread_t: 0x4bad740
  { delay = 100, elapsed = 101 }
ok 3 test thread sleep msecs in main thread
  { delay = 100, elapsed = 120 }
ok 4 test thread create with options table
ok 5 test thread arguments limit
hello world from thread
ok 6 test avoid thread be released before it done
ok 7 thread_getcpu
ok 8 getaffinity, setaffinity
default priority        0
priority after change   0
priority in thread      0
ok 9 getpriority, setpriority
ok 10 getname, setname
hello world from detached thread
ok 11 detach
# All tests passed
==190== 
==190== HEAP SUMMARY:
==190==     in use at exit: 0 bytes in 0 blocks
==190==   total heap usage: 11,698 allocs, 11,698 frees, 1,310,220 bytes allocated
==190== 
==190== All heap blocks were freed -- no leaks are possible
==190== 
==190== For lists of detected and suppressed errors, rerun with: -s

The next concern is that it appears setpriority is doing nothing, because THREAD_PRIORITY_LOWEST is not 0 on my system.

@squeek502
Copy link
Member Author
squeek502 commented May 13, 2025

getpriority returning 0 is actually the expected behavior I believe. Here's the relevant part of the libuv test:

https://github.com/libuv/libuv/blob/8d4218dcd7ac30a510caec3da50374b517ede425/test/test-thread-priority.c#L68-L85

When running that test locally:

  • min gets set to 0 (returned from sched_get_priority_min)
  • UV_THREAD_PRIORITY_LOWEST is -2
  • priority after the uv_thread_getpriority call is 0
  • ASSERT_EQ(priority, min); passes since both are 0 8000

@squeek502
Copy link
Member Author

@truemedian would you like to PR theuv_sem_t binding?

@truemedian
Copy link
Member
truemedian commented May 17, 2025

Submitted a PR for it but it exposed some rather concerning underlying problems about how threads work in luv. In particular around how the internals of handles are designed around an extra layer of pointer indirection to allow them to be passed across threads despite the fact that libuv states that only uv_async_t has any thread safety guarantees and therefore every other handle type cannot be safely passed to any other thread.

I'm in the process of going through and figuring out what exactly is broken and in the process might end up with just a large patch to improve/update/fix everything I find.

@squeek502
Copy link
Member Author
squeek502 commented May 17, 2025

Do you think it's worth merging the uv_sem_t bindings now, or would it be better to wait until the larger fix is ready?

If it would be better to wait, we could just disable the flaky getpriority, setpriority test temporarily to fix the intermittent CI failures.

@truemedian
Copy link
Member

The semaphore bindings are useful if ever so slightly, and it's mostly self contained.

I have no idea how long what I'm working on with take, but I foresee at least a week (probably more) and probably way more modifications than anyone is comfortable seeing.

@squeek502
Copy link
Member Author

No rush, I'm satisfied that we at least know what's happening with the intermittent valgrind failures.

Do you think the uv_sem_t bindings will be unchanged after the full fix? That is, would it be okay to release a version that includes the uv_sem_t bindings in the meantime?

@truemedian
Copy link
Member
truemedian commented May 17, 2025

The interface shouldn't need to change, there really isn't much to bind there. I'm fairly confident that I can assert the interface will never need to change (unless I massively overlooked something)

Additionally, I'm pretty sure everything I want to change is under the hood and should have no interface-breaking concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0