8000 Add KMeansPolis implementation that more closely maps to Polis algorithm by patcon · Pull Request #8 · polis-community/red-dwarf · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add KMeansPolis implementation that more closely maps to Polis algorithm #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

patcon
Copy link
Member
@patcon patcon commented Feb 25, 2025

This is low priority, and will be left as a draft for discussion, for data science folks who know more than me to chime in on.

KMeans runs a certain number of iterations up to a max, but stops when certain conditions are met between successive iterations.

  • In sklearn's KMeans implementation, the inertia is used to determine when to stop.
  • In Polis' KMeans implementation, the cluster center movement falling within a tolerance determines when to stop.

This might lead to slightly different results, which might matter later. Leaving this here for posterity.

Might investigate this further when we have unit tests over run_kmeans() that test various sizes of conversations.


This PR was written with the help of ChatGPT:
See: https://chatgpt.com/c/67be1cc5-b00c-800b-95ba-a0267edfb836

  • Clojure's [Polis] threshold is per-cluster-center movement, while sklearn's tol is based on total inertia change.
  • If clusters shift slightly but inertia barely changes, sklearn might stop earlier than [Polis'] same-clustering?.
  • If inertia fluctuates while centers stay put, sklearn might run longer than [Polis'] same-clustering?.

If cluster centers move but inertia remains unchanged, it means the reassignment of points to clusters does not significantly affect the total squared distances.

Possible Scenarios Where This Happens:

  1. Centers Shift Without Changing Assignments
    • If all data points remain assigned to the same clusters despite center movement, inertia stays the same.
    • Example: The cluster centers jitter slightly but the sum of squared distances doesn’t change.
  2. Symmetric Reassignment of Points
    • Suppose some points switch clusters, but the overall distribution remains similar.
    • Example: Two clusters swap a few points, but the distance to the centers remains the same.
  3. Flat Regions in the Data
    • If the dataset has a uniform spread of points, minor shifts in cluster centers might not impact the overall distance sum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant
0