-
Notifications
You must be signed in to change notification settings - Fork 451
HPSF CI test with more build types #8011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
HPSF CI test with more build types #8011
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containers/src/Kokkos_Vector.hpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't rely on the status after calling resize
. It might be modified on the host or device. With the current implementation we are ending up setting both modified flags which we catch in Debug
mode.
Since we are modifying the host-side after resize, we should only sync then. Also, we can be more specific about the modifed flags afterwards by calling modify_host
.
if (std::is_same_v<TEST_EXECSPACE, Kokkos::SYCL>) upper_limit_0 = 19; | ||
#endif | ||
for (int i = 1; i < upper_limit_0; i++) { | ||
// FIXME_SYCL Failing for Intel GPUs highly dependent on optimization flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be fine for oneAPI 2025.0.4 and later
CI timings for a rough estimate:
|
Retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like the comment to be more clear but this is non blocking.
The rest is just comments/questions outside of the review scope
@@ -115,6 +115,10 @@ TEST(defaultdevicetype, shared_space) { | |||
GTEST_SKIP() << "skipping because clock_tic is only defined for sycl+intel " | |||
"gpu and with rdc support"; | |||
#endif | |||
#if defined(KOKKOS_ENABLE_DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would expect this to fail in DEBUG mode...I would expect the cycle counting to only work reliably for highly optimized kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do in debug mode that causes failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do the same, but counting clock cycles to introspect when pages are migrated does not sound robust in a debug mode. We have no idea if it does do the same granularity etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on a discussion with Jakob, now wondering if that guard should be
#if defined(KOKKOS_ENABLE_DEBUG) | |
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, without bounds checking this should actually work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMAKE_BUILD_TYPE=Debug
implies that Kokkos_ENABLE_DEBUG
and Kokkos_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
are activated by default but Kokkos_ENABLE_DEBUG_BOUNFS_CHECK
's default value is always OFF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for correcting me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I don't know why this is failing. Any guesses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record I am updating the doc to reflect reality
kokkos/kokkos-core-wiki#669
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think debug mode with turned off optimization might just make everything a bit more unpredictable. For example functions not inlined that lead to jumps in memory that can cause varying timings etc.. So while I wouldn't have bet on this, it also isn't necessarily super surprising.
7c5bc63
to
6f95df2
Compare
"is defined"; | ||
GTEST_SKIP() | ||
<< "skipping because SYCL device-side abort does not show an error " | ||
"message independent of NDEBUG"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Aurora
, out of the 4 checks
Abort
SpaceAwareAccessorAccessViolation
ViewMemoryAccessViolation
ViewOutOfBoundsAccess
only SpaceAwareAccessorAccessViolation
and ViewMemoryAccessViolation
are failing with something like
[ RUN ] sycl_DeathTest.mdspan_space_aware_accessor_invalid_access_from_device
FATAL: Unexpected page fault from GPU at 0x0, ctx_id: 9 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 0 (Read), banned: 1, aborting.
Abort was called at 287 line in file:
/home/ubit/rpmbuild/BUILD/intel-compute-runtime-24.35.30872.31/shared/source/os_interface/linux/drm_neo.cpp
/home/darndt/kokkos_new/core/unit_test/TestSpaceAwareAccessorAccessViolation.hpp:44: Failure
Death test: { Kokkos::parallel_for(Kokkos::RangePolicy<ExecutionSpace>(s, 0, 1), *this); Kokkos::fence(); }
Result: died but not with expected error.
Expected: contains regular expression "Kokkos::SpaceAwareAccessor ERROR: attempt to access inaccessible memory space"
Actual msg:
[ DEATH ] FATAL: Unexpected page fault from GPU at 0x0, ctx_id: 9 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 0 (Read), banned: 1, aborting.
[ DEATH ]
[ FAILED ] sycl_DeathTest.mdspan_space_aware_accessor_invalid_access_from_device (585 ms)
so the invalid access has a higher priority than the error message from Kokkos::abort
and the kernel keeps executing after calls to Kokkos::abort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now just checking that we abort no matter what the error message is.
@@ -115,6 +115,10 @@ TEST(defaultdevicetype, shared_space) { | |||
GTEST_SKIP() << "skipping because clock_tic is only defined for sycl+intel " | |||
"gpu and with rdc support"; | |||
#endif | |||
#if defined(KOKKOS_ENABLE_DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do in debug mode that causes failure?
containers/src/Kokkos_Vector.hpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want that patch merged in as part of this PR?
Isn't that it bug fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to create a pull request just for this change but the test only fails in Debug mode right now, i.e., we coincidentally do the right thing regardless and I'm not to keen on writing more tests for a deprecated feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, and also I want the vector thing in the separate PR merged first.
@@ -115,6 +115,10 @@ TEST(defaultdevicetype, shared_space) { | |||
GTEST_SKIP() << "skipping because clock_tic is only defined for sycl+intel " | |||
"gpu and with rdc support"; | |||
#endif | |||
#if defined(KOKKOS_ENABLE_DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think debug mode with turned off optimization might just make everything a bit more unpredictable. For example functions not inlined that lead to jumps in memory that can cause varying timings etc.. So while I wouldn't have bet on this, it also isn't necessarily super surprising.
@@ -131,7 +131,7 @@ struct StaticRank<0> { | |||
TEST(TEST_CATEGORY_DEATH, view_construction_with_wrong_params_stat) { | |||
::testing::FLAGS_gtest_death_test_style = "threadsafe"; | |||
|
|||
#ifndef KOKKOS_ENABLE_DEBUG_BOUNDS_CHECKS | |||
#ifndef KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so did the test pass? Or did we just not test with bounds check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA-11.6-NVCC-DEBUG
build has Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON
.
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Co-authored-by: JBludau <104908666+JBludau@users.noreply.github.com> Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
d5d9d62
to
e5a7092
Compare
The requested changes and questions were addressed.
This pull request adds more build types for the existing HPSF CI builds
RelWithDebInfo
Release
Debug
Building in
Debug
mode required a bunch of changes:AKokkos::Vector
unit test is broken and we need the same fix as in DualView: Store modified flags as bool #7993.SYCL
more bessel function unit tests are broken.view_memory_access_violations_from_device
doesn't work independent of the build type for SYCL. The error message captured is just empty.Drive-by-change:
KOKKOS_ENABLE_DEBUG_BOUNDS_CHECKS
->KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK
https://gitlab.spack.io/masterleinad/kokkos/-/pipelines/1058739 shows a run in
Debug
mode with these changes. I verified manually that theatomic
performance regression is the only test failing forHIP
.