Fix UMAP outlier issue #6662

viclafargue · 2025-05-09T16:51:06Z

Answers #6454

This PR addresses the issue of outliers in embeddings generated by UMAP.
It introduces two key improvements to mitigate this problem:

Dynamic rounding adjustment: The rounding factor is updated at each epoch to enhance accuracy and prevent large, unstable gradient updates.
Lowering of the max_abs value : The rounding factor is determined based on the worst-case maximum possible gradient update. This max_abs value depends on the maximum number of connections (edges) a sample can have in the graph. The PR lowers this number by using a more conservative estimate of connectivity, specifically it sets the value based on the 95th percentile of the number of connections. As a result, the worst-case gradient update is smaller, which allows for a smaller rounding factor and therefore improved precision. Notably, this change does not appear to affect the results of reproducibility tests.

copy-pr-bot · 2025-05-09T16:51:09Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

divyegala

Is there a way to add a test for this?

cpp/src/umap/simpl_set_embed/algo.cuh

viclafargue · 2025-05-28T11:34:47Z

Is there a way to add a test for this?

Unfortunately, not really. The outlier samples with a large number of connections only occur in very large datasets, which aren't suitable for inclusion in the CI. While it might be possible to generate this type of unbalanced graph artificially, I don't currently have a method for doing so.

csadorf · 2025-05-28T14:20:45Z

@viclafargue can you update the labels for this PR?

@divyegala Would you be ok with merging this without the tests?

cjnolet · 2025-05-28T14:48:16Z

cpp/src/umap/simpl_set_embed/algo.cuh

-  return create_rounding_factor(max_abs, n_edges);
+
+  // Sort the buffer
+  thrust::sort(rmm::exec_policy(stream), buffer.data(), buffer.data() + n_samples);


@viclafargue have you done any benchmarking on this to evaluate the impact? I understand this is fixing a bug that surfaces only sometimes- but if there's a major perf cost here then we'd be paying it even if in the case it's not present, right?

We should try and better characterize this so that we can 1) test it, and 2) determine when it's needed.

I ran a benchmark to assess the impact of the change, and it doesn't appear to have any significant effect.

import time from cuml.datasets import make_blobs from cuml.manifold import UMAP X, _ = make_blobs( n_samples=1_000_000, n_features=64, centers=10, cluster_std=1.0, random_state=42, ) umap = UMAP( n_neighbors=15, n_components=2, min_dist=0.1, random_state=42, ) start = time.time() embedding = umap.fit_transform(X) duration = time.time() - start print(f"UMAP fit_transform took {duration:.2f} seconds")

Before change :
UMAP fit_transform took 48.32 seconds
UMAP fit_transform took 48.67 seconds
UMAP fit_transform took 48.52 seconds

After change :
UMAP fit_transform took 48.80 seconds
UMAP fit_transform took 48.26 seconds
UMAP fit_transform took 48.49 seconds

csadorf · 2025-05-28T14:50:00Z

In light of @cjnolet 's concern regarding a potential performance regression, I recommend pushing this to 25.08.

Fix UMAP outlier issue

8b784f0

github-actions bot added the CUDA/C++ label May 9, 2025

divyegala assigned viclafargue May 13, 2025

csadorf linked an issue May 14, 2025 that may be closed by this pull request

cuml.UMAP embeddings result in outliers #6454

Open

viclafargue added 2 commits May 23, 2025 10:02

Merge branch 'branch-25.06' into fix-umap-outlier-issue

61c6518

n_edges to 95 percentile

ba0822f

viclafargue marked this pull request as ready for review May 23, 2025 09:33

viclafargue requested a review from a team as a code owner May 23, 2025 09:33

viclafargue requested review from dantegd and divyegala May 23, 2025 09:33

Merge branch 'branch-25.06' into fix-umap-outlier-issue

ec91e08

divyegala reviewed May 27, 2025

View reviewed changes

cpp/src/umap/simpl_set_embed/algo.cuh Outdated Show resolved Hide resolved

divyegala reviewed May 28, 2025

View reviewed changes

cpp/src/umap/simpl_set_embed/algo.cuh Outdated Show resolved Hide resolved

adressing review

a531617

viclafargue force-pushed the fix-umap-outlier-issue branch from aa9d31f to a531617 Compare May 28, 2025 10:33

viclafargue added bug Something isn't working non-breaking Non-breaking change labels May 28, 2025

cjnolet reviewed May 28, 2025

View reviewed changes

Merge branch 'branch-25.08' into fix-umap-outlier-issue

bb5502e

viclafargue changed the base branch from branch-25.06 to branch-25.08 June 5, 2025 17:14

divyegala approved these changes Jun 16, 2025

View reviewed changes

Merge branch 'branch-25.08' into fix-umap-outlier-issue

2b0ca18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix UMAP outlier issue #6662

Fix UMAP outlier issue #6662

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix UMAP outlier issue #6662

Are you sure you want to change the base?

Fix UMAP outlier issue #6662

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!