-
Notifications
You must be signed in to change notification settings - Fork 577
Fix UMAP outlier issue #6662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Fix UMAP outlier issue #6662
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to add a test for this?
aa9d31f
to
a531617
Compare
Unfortunately, not really. The outlier samples with a large number of connections only occur in very large datasets, which aren't suitable for inclusion in the CI. While it might be possible to generate this type of unbalanced graph artificially, I don't currently have a method for doing so. |
@viclafargue can you update the labels for this PR? @divyegala Would you be ok with merging this without the tests? |
return create_rounding_factor(max_abs, n_edges); | ||
|
||
// Sort the buffer | ||
thrust::sort(rmm::exec_policy(stream), buffer.data(), buffer.data() + n_samples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viclafargue have you done any benchmarking on this to evaluate the impact? I understand this is fixing a bug that surfaces only sometimes- but if there's a major perf cost here then we'd be paying it even if in the case it's not present, right?
We should try and better characterize this so that we can 1) test it, and 2) determine when it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a benchmark to assess the impact of the change, and it doesn't appear to have any significant effect.
import time
from cuml.datasets import make_blobs
from cuml.manifold import UMAP
X, _ = make_blobs(
n_samples=1_000_000,
n_features=64,
centers=10,
cluster_std=1.0,
random_state=42,
)
umap = UMAP(
n_neighbors=15,
n_components=2,
min_dist=0.1,
random_state=42,
)
start = time.time()
embedding = umap.fit_transform(X)
duration = time.time() - start
print(f"UMAP fit_transform took {duration:.2f} seconds")
Before change :
UMAP fit_transform took 48.32 seconds
UMAP fit_transform took 48.67 seconds
UMAP fit_transform took 48.52 seconds
After change :
UMAP fit_transform took 48.80 seconds
UMAP fit_transform took 48.26 seconds
UMAP fit_transform took 48.49 seconds
In light of @cjnolet 's concern regarding a potential performance regression, I recommend pushing this to 25.08. |
Answers #6454
This PR addresses the issue of outliers in embeddings generated by UMAP.
It introduces two key improvements to mitigate this problem:
max_abs
value : The rounding factor is determined based on the worst-case maximum possible gradient update. Thismax_abs
value depends on the maximum number of connections (edges) a sample can have in the graph. The PR lowers this number by using a more conservative estimate of connectivity, specifically it sets the value based on the 95th percentile of the number of connections. As a result, the worst-case gradient update is smaller, which allows for a smaller rounding factor and therefore improved precision. Notably, this change does not appear to affect the results of reproducibility tests.