-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
gpuCI broken #11312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for raising an issue @fjetter - I'll definitely work on getting gpuCI back in a passing state today. I'm not sure what is causing the build failure, but I do know some recent dask/array work has definitely broken cupy support. |
I'm really sorry there has been so much unwanted gpuCI noise lately. It looks like gpuCI is now "fixed" in the sense that the pytests should all pass. However, the We have not figured out how to fix this intermittent failure yet. However, it you do happen to see this failure in the wild, members of the dask org can re-run the gpuCI check (and only that check) by commenting: cc @fjetter @phofl @jrbourbeau @hendrikmakait (just to make sure you know about |
@dask/gpu gpuCI appears to be broken again. One example #11354 but there are other failures and it looks quite intermittent. Looking at Jenkins this almost feels like a gpuCI internal problem. |
Right, the failures are intermittent, and can always be re-run with a Our ops team is working on a replacement to our current Jenkins infrastructure at the moment. I'm sorry again for the noise. |
How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again? |
We are discussing this internally to figure out the best way to proceed, but I do have a strong preference to keep gpuCI turned on for now if you/others are willing. Our team obviously finds gpuCI valuable, but I do understand why you would see things a different way. When gpuCI was actually broken a few weeks ago (not just flaky the way it is now), changes were merged into main that broke cupy support. In theory, gpuCI is a convenient way for contributors/maintainers to know right away if a new change is likely to break GPU compatibility. The alternative is of course that we (RAPIDS) run our own nightly tests against main, and raise an issue when something breaks. In some cases, the fix will be simple. In others, the change could be a nightmare to roll back or fix. What would be an ideal developer experience on your end? I'm hoping we can work toward something that makes everyone "happy enough". |
Roughly a year ago we had proposed moving Dask to a GitHub Actions based system for GPU CI in issue: dask/community#348 We didn't hear much from other maintainers there (admittedly there could have been offline discussion I'm unaware of) Perhaps it is worth reading that issue and sharing your thoughts on that approach? 🙂 |
This has been raised on #11242 already but I always have difficulties finding that draft PR and the failures are not related to a version update from what I can tell.
gpuCI has been pretty consistently failing for a while now.
Logs show something like (from #11310 // https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/console)
cc @dask/gpu
The text was updated successfully, but these errors were encountered: