8000 HPSF CI test with more build types by masterleinad · Pull Request #8011 · kokkos/kokkos · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

HPSF CI test with more build types #8011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

masterleinad
Copy link
Contributor
@masterleinad masterleinad commented Apr 21, 2025

This pull request adds more build types for the existing HPSF CI builds

  • for pull requests, we continue to use RelWithDebInfo
  • for merge commits, we use Release
  • for nightly builds, we use Debug

Building in Debug mode required a bunch of changes:

  • The shared_space test is failing frequently (which is spurious and we care about behavior in non-Debug builds)
  • A Kokkos::Vector unit test is broken and we need the same fix as in DualView: Store modified flags as bool #7993.
  • For SYCL more bessel function unit tests are broken.
  • view_memory_access_violations_from_device doesn't work independent of the build type for SYCL. The error message captured is just empty.

Drive-by-change: KOKKOS_ENABLE_DEBUG_BOUNDS_CHECKS -> KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK

https://gitlab.spack.io/masterleinad/kokkos/-/pipelines/1058739 shows a run in Debug mode with these changes. I verified manually that the atomic performance regression is the only test failing for HIP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means these have not been running for a while.
The bug was introduced in #5190 and had gone unnoticed in #7310.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't rely on the status after calling resize. It might be modified on the host or device. With the current implementation we are ending up setting both modified flags which we catch in Debug mode.
Since we are modifying the host-side after resize, we should only sync then. Also, we can be more specific about the modifed flags afterwards by calling modify_host.

if (std::is_same_v<TEST_EXECSPACE, Kokkos::SYCL>) upper_limit_0 = 19;
#endif
for (int i = 1; i < upper_limit_0; i++) {
// FIXME_SYCL Failing for Intel GPUs highly dependent on optimization flags
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be fine for oneAPI 2025.0.4 and later

@masterleinad
Copy link
Contributor Author
masterleinad commented Apr 22, 2025

CI timings for a rough estimate:

Build Debug RelWithDebInfo Release
AMD-MI300A 21m 41s 23m 04s 20m 34s
INTEL-DATA-CENTER-MAX 49m 52s 35m 59s 28m 22s
NVIDIA-GH200 35m 05s 10m 33s 17m 20s
NVIDIA-GRACE-GRACE 22m 46s 5m 12s 5m 40s
NVIDIA-P100 52m 28s 12m 11s 12m 4s
NVIDIA-RTX5080 25m 05s 10m 13s 10m 15s

Debug expectedly is most expensive and there isn't much of difference between Release and RelWithDebInfo. Thus, I would use Debug for Nightly builds but don't have a strong opinion about the other two.

@masterleinad
Copy link
Contributor Author

Retest this please.

@masterleinad masterleinad requested a review from JBludau April 23, 2025 12:07
Copy link
Contributor
@JBludau JBludau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like the comment to be more clear but this is non blocking.
The rest is just comments/questions outside of the review scope

@@ -115,6 +115,10 @@ TEST(defaultdevicetype, shared_space) {
GTEST_SKIP() << "skipping because clock_tic is only defined for sycl+intel "
"gpu and with rdc support";
#endif
#if defined(KOKKOS_ENABLE_DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would expect this to fail in DEBUG mode...I would expect the cycle counting to only work reliably for highly optimized kernels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in debug mode that causes failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the same, but counting clock cycles to introspect when pages are migrated does not sound robust in a debug mode. We have no idea if it does do the same granularity etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a discussion with Jakob, now wondering if that guard should be

Suggested change
#if defined(KOKKOS_ENABLE_DEBUG)
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, without bounds checking this should actually work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMAKE_BUILD_TYPE=Debug implies that Kokkos_ENABLE_DEBUG and Kokkos_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK are activated by default but Kokkos_ENABLE_DEBUG_BOUNFS_CHECK's default value is always OFF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for correcting me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I don't know why this is failing. Any guesses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record I am updating the doc to reflect reality
kokkos/kokkos-core-wiki#669

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug mode with turned off optimization might just make everything a bit more unpredictable. For example functions not inlined that lead to jumps in memory that can cause varying timings etc.. So while I wouldn't have bet on this, it also isn't necessarily super surprising.

"is defined";
GTEST_SKIP()
<< "skipping because SYCL device-side abort does not show an error "
"message independent of NDEBUG";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Aurora, out of the 4 checks

      Abort
      SpaceAwareAccessorAccessViolation
      ViewMemoryAccessViolation
      ViewOutOfBoundsAccess

only SpaceAwareAccessorAccessViolation and ViewMemoryAccessViolation are failing with something like

[ RUN      ] sycl_DeathTest.mdspan_space_aware_accessor_invalid_access_from_device
FATAL: Unexpected page fault from GPU at 0x0, ctx_id: 9 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 0 (Read), banned: 1, aborting.
Abort was called at 287 line in file:
/home/ubit/rpmbuild/BUILD/intel-compute-runtime-24.35.30872.31/shared/source/os_interface/linux/drm_neo.cpp
/home/darndt/kokkos_new/core/unit_test/TestSpaceAwareAccessorAccessViolation.hpp:44: Failure
Death test: { Kokkos::parallel_for(Kokkos::RangePolicy<ExecutionSpace>(s, 0, 1), *this); Kokkos::fence(); }
    Result: died but not with expected error.
  Expected: contains regular expression "Kokkos::SpaceAwareAccessor ERROR: attempt to access inaccessible memory space"
Actual msg:
[  DEATH   ] FATAL: Unexpected page fault from GPU at 0x0, ctx_id: 9 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 0 (Read), banned: 1, aborting.
[  DEATH   ] 
[  FAILED  ] sycl_DeathTest.mdspan_space_aware_accessor_invalid_access_from_device (585 ms)

so the invalid access has a higher priority than the error message from Kokkos::abort and the kernel keeps executing after calls to Kokkos::abort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now just checking that we abort no matter what the error message is.

@@ -115,6 +115,10 @@ TEST(defaultdevicetype, shared_space) {
GTEST_SKIP() << "skipping because clock_tic is only defined for sycl+intel "
"gpu and with rdc support";
#endif
#if defined(KOKKOS_ENABLE_DEBUG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in debug mode that causes failure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want that patch merged in as part of this PR?
Isn't that it bug fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to create a pull request just for this change but the test only fails in Debug mode right now, i.e., we coincidentally do the right thing regardless and I'm not to keen on writing more tests for a deprecated feature.

Copy link
Member
@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, and also I want the vector thing in the separate PR merged first.

@@ -115,6 +115,10 @@ TEST(defaultdevicetype, shared_space) {
GTEST_SKIP() << "skipping because clock_tic is only defined for sycl+intel "
"gpu and with rdc support";
#endif
#if defined(KOKKOS_ENABLE_DEBUG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug mode with turned off optimization might just make everything a bit more unpredictable. For example functions not inlined that lead to jumps in memory that can cause varying timings etc.. So while I wouldn't have bet on this, it also isn't necessarily super surprising.

@@ -131,7 +131,7 @@ struct StaticRank<0> {
TEST(TEST_CATEGORY_DEATH, view_construction_with_wrong_params_stat) {
::testing::FLAGS_gtest_death_test_style = "threadsafe";

#ifndef KOKKOS_ENABLE_DEBUG_BOUNDS_CHECKS
#ifndef KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so did the test pass? Or did we just not test with bounds check?

Copy link
Contributor Author
@masterleinad masterleinad Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA-11.6-NVCC-DEBUG build has Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON.

@dalg24 dalg24 mentioned this pull request Apr 28, 2025
masterleinad and others added 9 commits April 28, 2025 13:23
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Co-authored-by: JBludau <104908666+JBludau@users.noreply.github.com>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
Signed-off-by: Daniel Arndt <arndtd@ornl.gov>
@masterleinad masterleinad force-pushed the hpsf_ci_more_build_types branch from d5d9d62 to e5a7092 Compare April 28, 2025 17:24
@masterleinad masterleinad requested a review from crtrott April 28, 2025 17:24
@masterleinad masterleinad dismissed crtrott’s stale review April 30, 2025 19:39

The requested changes and questions were addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0