8000 Implement transform method by cannoneyed · Pull Request #6 · PAIR-code/umap-js · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Implement transform method #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 2, 2019
Merged

Implement transform method #6

merged 2 commits into from
May 2, 2019

Conversation

cannoneyed
Copy link
Member
@cannoneyed cannoneyed commented Apr 30, 2019

Implements a transform method that allows for additional points to be projected after the initial fit and projection of the data.

Involves quite a bit of refactoring / additional functionality to handle the infrastructure for transformation. Also introduces a number of new hyperparameters: learningRate, localConnectivity, negativeSampleRate, repulsionStrength, setOpMixRatio, and transformQueueSize.


This change is Reviewable

@cannoneyed
Copy link
Member Author

Just noticing some build/test flakiness that I haven't quite figured out how to deal with (need a Travis token to debug and can't access one for this organization/repo... weird). But building OK now!

Copy link
Collaborator
@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Took a first pass. Most of the comments are tips/suggestions wrt style/lint/setup to align with google styleguides. Also would be great if you can drop the compiled bundles from the repo.

Reviewed 5 of 15 files at r1.
Reviewable status: 5 of 15 files reviewed, 11 unresolved discussions (waiting on @cannoneyed and @dsmilkov)


README.md, line 66 at r1 (raw file):

#### Transforming additional points after fitting

```typescript

fyi, since not all js folks know typescript, but all ts folks know js, would be good to switch the examples/docs to pure js. (We do this in tf.js)

this particular example is also a valid js since there are no typings.


lib/umap-js.js, line 225 at r1 (raw file):

    return max;
}
exports.max2d = max2d;

(optional) usually we don't commit the compiled library to github (you can still make sure it's there before you call npm publish. Would also make the reviews easier. I often do ctrl+f to jump around the PR looking for specific things and it brings me to the bundle.


src/heap.ts, line 152 at r1 (raw file):

/**
 * Push a new element onto the heap. The heap stores potential neighbors
 * for each data point. The ``row`` parameter determines which data point we

tip: single backtick also works.


src/heap.ts, line 157 at r1 (raw file):

 * is to be considered a new addition.
 */
export function uncheckedHeapPush(

npm install clang-format which you can configure with the "google" style, and not worry about formatting again. If you use vscode, you can also install the clang-format extension and enable auto-format upon "saving a file". See https://github.com/tensorflow/tfjs-core/blob/master/.vscode/settings.json#L24


src/heap.ts, line 157 at r1 (raw file):

 * is to be considered a new addition.
 */
export function uncheckedHeapPush(

maybe rename this method to fastHeapPush if speed is the fundamental benefit of having unchecked push.


src/heap.ts, line 164 at r1 (raw file):

  flag: number
): number {
  const indices = heap[0][row];

Let's make heap an interface,

interface Heap {
 indices, weights, isNew
}

this will increase readability. Not for this PR, but consider making it a class and moving the push/pop methods inside the class.


src/heap.ts, line 181 at r1 (raw file):

  let iSwap = 0;
  while (true) {
    const ic1 = 2 * i + 1;

Would be great to add a tslint.json to the repo and install tslint via npm. Reuse this config https://github.com/tensorflow/tfjs-core/blob/master/tslint.json if you want to align with google-style. things like let vs const will be caught and can be auto-fixed by the linter


src/nn_descent.ts, line 177 at r1 (raw file):

        }
        const d = distanceFn(data[indices[j]], queryPoints[i]);
        heap.heapPush(_heap, i, d, indices[j], 1);

since heap is the namespace, it collides with the heap variable. A suggestion: import only the functions from heap that you need, i.e. import {heapPush} from './heap'; -- this style of import is better since it lends itself to tree-shaking.


src/nn_descent.ts, line 182 at r1 (raw file):

  };

  const initFromTree: InitFromTreeFn = (

style comment: prefer named functions, that is function initFromTree () {} instead of anonymous functions that you assign to variables. this is better for debugging purposes (stack trace is cleaner). Also avoid functions inside other functions. Try to keep util-like functions at the top-level with little state about the world.


src/utils.ts, line 158 at r1 (raw file):

/**
 * Generate nSamples many integers from 0 to pool_size such that no

poolSize


src/utils.ts, line 162 at r1 (raw file):

 * rejection sampling.
 */
export function rejectionSample(nSamples: number, poolSize: number): number[] {

prefer Float32Array instead of number[] if you are holding floats.


src/utils.ts, line 187 at r1 (raw file):

 * Reshapes a 1d array into a 2D of given dimensions.
 */
export function reshape2d<T>(x: T[], a: number, b: number): T[][] {

generally avoid keeping data as nested arrays if you are holding more than 100k points. Use flat arrays with flat indexing if the data is non-primitive, otherwise use TypedArrays (Float32Array, Int32Array, Uint8Array) if the data is primitive numbers. You can get large memory savings and speedups.

Copy link
Member Author
@cannoneyed cannoneyed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review. Happy to implement much of this, but I'd love to chat with you in a bit more detail re: clangformat and tslint (they're just generally not part of my standard repo workflow so I don't have much experience with them)

Reviewable status: 5 of 15 files reviewed, 10 unresolved discussions (waiting on @dsmilkov)


README.md, line 66 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

fyi, since not all js folks know typescript, but all ts folks know js, would be good to switch the examples/docs to pure js. (We do this in tf.js)

this particular example is also a valid js since there are no typings.

Cool, all of the examples code is actually valid javascript - i'll make a note to switch the markdown labels to Javascript later.


lib/umap-js.js, line 225 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

(optional) usually we don't commit the compiled library to github (you can still make sure it's there before you call npm publish. Would also make the reviews easier. I often do ctrl+f to jump around the PR looking for specific things and it brings me to the bundle.

Yeah.... I've actually been trying to decide what the right course of action is here - Including the bundle is a relic of the external tensorboard build process (where it actually downloads a bundle from npm/github). I'll have to make sure this is all square there before removing the lib folder from github. I'm also going to move to include the dist folder (compiled ts) because I've had a number of people want it included for testing out code in feature branches.


src/heap.ts, line 152 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

tip: single backtick also works.

Holdover from the python docs!


src/heap.ts, line 157 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

npm install clang-format which you can configure with the "google" style, and not worry about formatting again. If you use vscode, you can also install the clang-format extension and enable auto-format upon "saving a file". See https://github.com/tensorflow/tfjs-core/blob/master/.vscode/settings.json#L24

I'm using prettier for this repo which is the de facto standard code formatter for JS these days (and I vastly prefer to clang-format). I would prefer to keep using it unless there's any important reason?


src/heap.ts, line 157 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

maybe rename this method to fastHeapPush if speed is the fundamental benefit of having unchecked push.

I'm really just following along with the python reference implementation here. The only big difference between my code and the python implementation is that I use uncheckedHeapPush inside of the normal heapPush function rather than duplicate the code because JS can optimize that second function call better than the python.


src/heap.ts, line 164 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Let's make heap an interface,

interface Heap {
 indices, weights, isNew
}

this will increase readability. Not for this PR, but consider making it a class and moving the push/pop methods inside the class.

Yeah, this is a good idea. For some context here, I've been following the reference python implementation as closely as possible to avoid bugs / logic drift, and I think it's best to keep them as similar as possible. Although this case seems fairly straightforward to understand. I'll make a note to change it in a subsequent PR.


src/heap.ts, line 181 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Would be great to add a tslint.json to the repo and install tslint via npm. Reuse this config https://github.com/tensorflow/tfjs-core/blob/master/tslint.json if you want to align with google-style. things like let vs const will be caught and can be auto-fixed by the linter

I'm cool to add tslint as well, but I feel like it usually doesn't give me that much which is why I opted against it in this repo. Happy to give it a shot in a subsequent PR though! Is there something specific here?


src/nn_descent.ts, line 177 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

since heap is the namespace, it collides with the heap variable. A suggestion: import only the functions from heap that you need, i.e. import {heapPush} from './heap'; -- this style of import is better since it lends itself to tree-shaking.

Yeah, this is another holdover from the tensorboard build system. That's my preferred way of doing business but I switched it up early on when this lib lived in tensorboard (where everything is wacky) and haven't switched back. Will make a change in a subsequent PR.


src/nn_descent.ts, line 182 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

style comment: prefer named functions, that is function initFromTree () {} instead of anonymous functions that you assign to variables. this is better for debugging purposes (stack trace is cleaner). Also avoid functions inside other functions. Try to keep util-like functions at the top-level with little state about the world.

Agreed, but this is a higher-order function factory so it has to work this way. Happy to change to named functions though.


src/utils.ts, line 162 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

prefer Float32Array instead of number[] if you are holding floats.

Yeah I've been wanting to do some work on the numerical side of things for a while now - it's a bit tricky because the algorithm is so complex / there's so many numpy/sklearn features that I'm reimplementing that I've sort of opted for simplicity first with an eye towards optimization later. But if I could do most of my work on Float32 math I could theoretically get some pretty big performance wins. I imagine you have quite a bit of experience here?


src/utils.ts, line 187 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

generally avoid keeping data as nested arrays if you are holding more than 100k points. Use flat arrays with flat indexing if the data is non-primitive, otherwise use TypedArrays (Float32Array, Int32Array, Uint8Array) if the data is primitive numbers. You can get large memory savings and speedups.

Pretty sure this function is only being used for relatively small arrays.

Copy link
Collaborator
@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Most of the comments were FYI for future work. Nice work!

Reviewed 8 of 15 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved


lib/umap-js.js, line 225 at r1 (raw file):

Previously, cannoneyed (Andy Coenen) wrote…

Yeah.... I've actually been trying to decide what the right course of action is here - Including the bundle is a relic of the external tensorboard build process (where it actually downloads a bundle from npm/github). I'll have to make sure this is all square there before removing the lib folder from github. I'm also going to move to include the dist folder (compiled ts) because I've had a number of people want it included for testing out code in feature branches.

Would be awesome if you can figure that out. If tensorboard bazel fetches from npm that's good news, since you can publish the bundle to npm without commiting it to the git repo.


src/heap.ts, line 157 at r1 (raw file):

Previously, cannoneyed (Andy Coenen) wrote…

I'm using prettier for this repo which is the de facto standard code formatter for JS these days (and I vastly prefer to clang-format). I would prefer to keep using it unless there's any important reason?

Internally, we are required to use the google style guide. Not sure about external repos. My guess is yes since they are owned by Google. Otherwise why bother publishing and maintaining a style guide: https://github.com/google/gts


src/heap.ts, line 157 at r1 (raw file):

Previously, cannoneyed (Andy Coenen) wrote…

I'm really just following along with the python reference implementation here. The only big difference between my code and the python implementation is that I use uncheckedHeapPush inside of the normal heapPush function rather than duplicate the code because JS can optimize that second function call better than the python.

SGTM.


src/heap.ts, line 181 at r1 (raw file):

Previously, cannoneyed (Andy Coenen) wrote…

I'm cool to add tslint as well, but I feel like it usually doesn't give me that much which is why I opted against it in this repo. Happy to give it a shot in a subsequent PR though! Is there something specific here?

I think it might be required since the repo is owned by google (so you want other googlers to ramp up quickly). https://github.com/google/gts is the official tooling that wasn't ready when we released deeplearn.js (most likely wraps clang-format)


src/utils.ts, line 162 at r1 (raw file):

Previously, cannoneyed (Andy Coenen) wrote…

Yeah I've been wanting to do some work on the numerical side of things for a while now - it's a bit tricky because the algorithm is so complex / there's so many numpy/sklearn features that I'm reimplementing that I've sort of opted for simplicity first with an eye towards optimization later. But if I could do most of my work on Float32 math I could theoretically get some pretty big performance wins. I imagine you have quite a bit of experience here?

Yeah, lots of the numerical code in tf.js and the embedding project users typed arrays - you gets lots of wins, without any increase in code complexity. typedarrays have the same interface as regular flat arrays. Makes sense to tackle this later after you have parity with python.


src/utils.ts, line 187 at r1 (raw file):

Previously, cannoneyed (Andy Coenen) wrote…

Pretty sure this function is only being used for relatively small arrays.

SGTM

@cannoneyed cannoneyed merged commit ff25ac3 into master May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0