8000 Fix UMAP outlier issue by viclafargue · Pull Request #6662 · rapidsai/cuml · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix UMAP outlier issue #6662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: branch-25.08
Choose a base branch
from

Conversation

viclafargue
Copy link
Contributor
@viclafargue viclafargue commented May 9, 2025

Answers #6454

This PR addresses the issue of outliers in embeddings generated by UMAP.
It introduces two key improvements to mitigate this problem:

  1. Dynamic rounding adjustment: The rounding factor is updated at each epoch to enhance accuracy and prevent large, unstable gradient updates.
  2. Lowering of the max_abs value : The rounding factor is determined based on the worst-case maximum possible gradient update. This max_abs value depends on the maximum number of connections (edges) a sample can have in the graph. The PR lowers this number by using a more conservative estimate of connectivity, specifically it sets the value based on the 95th percentile of the number of connections. As a result, the worst-case gradient update is smaller, which allows for a smaller rounding factor and therefore improved precision. Notably, this change does not appear to affect the results of reproducibility tests.

Copy link
copy-pr-bot bot commented May 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@csadorf csadorf linked an issue May 14, 2025 that may be closed by this pull request
@viclafargue viclafargue marked this pull request as ready for review May 23, 2025 09:33
@viclafargue viclafargue requested a review from a team as a code owner May 23, 2025 09:33
@viclafargue viclafargue requested review from dantegd and divyegala May 23, 2025 09:33
Copy link
Member
@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to add a test for this?

@viclafargue viclafargue force-pushed the fix-umap-outlier-issue branch from aa9d31f to a531617 Compare May 28, 2025 10:33
@viclafargue
Copy link
Contributor Author

Is there a way to add a test for this?

Unfortunately, not really. The outlier samples with a large number of connections only occur in very large datasets, which aren't suitable for inclusion in the CI. While it might be possible to generate this type of unbalanced graph artificially, I don't currently have a method for doing so.

@csadorf
Copy link
Contributor
csadorf commented May 28, 2025

@viclafargue can you update the labels for this PR?

@divyegala Would you be ok with merging this without the tests?

@viclafargue viclafargue added bug Something isn't working non-breaking Non-breaking change labels May 28, 2025
return create_rounding_factor(max_abs, n_edges);

// Sort the buffer
thrust::sort(rmm::exec_policy(stream), buffer.data(), buffer.data() + n_samples);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viclafargue have you done any benchmarking on this to evaluate the impact? I understand this is fixing a bug that surfaces only sometimes- but if there's a major perf cost here then we'd be paying it even if in the case it's not present, right?

We should try and better characterize this so that we can 1) test it, and 2) determine when it's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a benchmark to assess the impact of the change, and it doesn't appear to have any significant effect.

import time
from cuml.datasets import make_blobs
from cuml.manifold import UMAP

X, _ = make_blobs(
    n_samples=1_000_000,
    n_features=64,
    centers=10,
    cluster_std=1.0,
    random_state=42,
)

umap = UMAP(
    n_neighbors=15,
    n_components=2,
    min_dist=0.1,
    random_state=42,
)

start = time.time()
embedding = umap.fit_transform(X)
duration = time.time() - start

print(f"UMAP fit_transform took {duration:.2f} seconds")

Before change :
UMAP fit_transform took 48.32 seconds
UMAP fit_transform took 48.67 seconds
UMAP fit_transform took 48.52 seconds

After change :
UMAP fit_transform took 48.80 seconds
UMAP fit_transform took 48.26 seconds
UMAP fit_transform took 48.49 seconds

@csadorf
Copy link
Contributor
csadorf commented May 28, 2025

In light of @cjnolet 's concern regarding a potential performance regression, I recommend pushing this to 25.08.

@viclafargue viclafargue changed the base branch from branch-25.06 to branch-25.08 June 5, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA/C++ non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cuml.UMAP embeddings result in outliers
4 participants
0