Implement transform method #6

cannoneyed · 2019-04-30T00:23:02Z

Implements a transform method that allows for additional points to be projected after the initial fit and projection of the data.

Involves quite a bit of refactoring / additional functionality to handle the infrastructure for transformation. Also introduces a number of new hyperparameters: learningRate, localConnectivity, negativeSampleRate, repulsionStrength, setOpMixRatio, and transformQueueSize.

This change is

cannoneyed · 2019-04-30T20:35:27Z

Just noticing some build/test flakiness that I haven't quite figured out how to deal with (need a Travis token to debug and can't access one for this organization/repo... weird). But building OK now!

dsmilkov

Nice work! Took a first pass. Most of the comments are tips/suggestions wrt style/lint/setup to align with google styleguides. Also would be great if you can drop the compiled bundles from the repo.

Reviewed 5 of 15 files at r1.
Reviewable status: 5 of 15 files reviewed, 11 unresolved discussions (waiting on @cannoneyed and @dsmilkov)

README.md, line 66 at r1 (raw file):

#### Transforming additional points after fitting

```typescript

fyi, since not all js folks know typescript, but all ts folks know js, would be good to switch the examples/docs to pure js. (We do this in tf.js)

this particular example is also a valid js since there are no typings.

lib/umap-js.js, line 225 at r1 (raw file):

    return max;
}
exports.max2d = max2d;

(optional) usually we don't commit the compiled library to github (you can still make sure it's there before you call npm publish. Would also make the reviews easier. I often do ctrl+f to jump around the PR looking for specific things and it brings me to the bundle.

src/heap.ts, line 152 at r1 (raw file):

/**
 * Push a new element onto the heap. The heap stores potential neighbors
 * for each data point. The ``row`` parameter determines which data point we

tip: single backtick also works.

src/heap.ts, line 157 at r1 (raw file):

 * is to be considered a new addition.
 */
export function uncheckedHeapPush(

npm install clang-format which you can configure with the "google" style, and not worry about formatting again. If you use vscode, you can also install the clang-format extension and enable auto-format upon "saving a file". See https://github.com/tensorflow/tfjs-core/blob/master/.vscode/settings.json#L24

src/heap.ts, line 157 at r1 (raw file):

 * is to be considered a new addition.
 */
export function uncheckedHeapPush(

maybe rename this method to fastHeapPush if speed is the fundamental benefit of having unchecked push.

src/heap.ts, line 164 at r1 (raw file):

  flag: number
): number {
  const indices = heap[0][row];

Let's make heap an interface,

interface Heap {
 indices, weights, isNew
}

this will increase readability. Not for this PR, but consider making it a class and moving the push/pop methods inside the class.

src/heap.ts, line 181 at r1 (raw file):

  let iSwap = 0;
  while (true) {
    const ic1 = 2 * i + 1;

Would be great to add a tslint.json to the repo and install tslint via npm. Reuse this config https://github.com/tensorflow/tfjs-core/blob/master/tslint.json if you want to align with google-style. things like let vs const will be caught and can be auto-fixed by the linter

src/nn_descent.ts, line 177 at r1 (raw file):

        }
        const d = distanceFn(data[indices[j]], queryPoints[i]);
        heap.heapPush(_heap, i, d, indices[j], 1);

since heap is the namespace, it collides with the heap variable. A suggestion: import only the functions from heap that you need, i.e. import {heapPush} from './heap'; -- this style of import is better since it lends itself to tree-shaking.

src/nn_descent.ts, line 182 at r1 (raw file):

  };

  const initFromTree: InitFromTreeFn = (

style comment: prefer named functions, that is function initFromTree () {} instead of anonymous functions that you assign to variables. this is better for debugging purposes (stack trace is cleaner). Also avoid functions inside other functions. Try to keep util-like functions at the top-level with little state about the world.

src/utils.ts, line 158 at r1 (raw file):

/**
 * Generate nSamples many integers from 0 to pool_size such that no

poolSize

src/utils.ts, line 162 at r1 (raw file):

 * rejection sampling.
 */
export function rejectionSample(nSamples: number, poolSize: number): number[] {

prefer Float32Array instead of number[] if you are holding floats.

src/utils.ts, line 187 at r1 (raw file):

 * Reshapes a 1d array into a 2D of given dimensions.
 */
export function reshape2d<T>(x: T[], a: number, b: number): T[][] {

generally avoid keeping data as nested arrays if you are holding more than 100k points. Use flat arrays with flat indexing if the data is non-primitive, otherwise use TypedArrays (Float32Array, Int32Array, Uint8Array) if the data is primitive numbers. You can get large memory savings and speedups.

cannoneyed

Thanks for the quick review. Happy to implement much of this, but I'd love to chat with you in a bit more detail re: clangformat and tslint (they're just generally not part of my standard repo workflow so I don't have much experience with them)

Reviewable status: 5 of 15 files reviewed, 10 unresolved discussions (waiting on @dsmilkov)

README.md, line 66 at r1 (raw file):