8000 Perform immut/{set, sparse_array} by FlyCloudC · Pull Request #2272 · moonbitlang/core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Perform immut/{set, sparse_array} #2272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 1, 2025

Conversation

FlyCloudC
Copy link
Contributor

This PR is the first part of the split #2205, with clearer git history.

  1. clean doc in sparse_array
  2. using Path to avoid passing depth
  3. replace Collision(Bucket[A]) with Leaf(A, @list.T[A])
  4. rename T to Node, remove Node::empty, define T as Node?
  5. replace nested Branch({ data:[Branch(...)], .. }) by Flat(key, path)

Parts from #2205 not included in this PR:

  1. adopt CPS style in add/remove to reuse old nodes in special cases (maybe we don't need this)
  2. optimize set operations: union, intersection, difference
  3. Improve test coverage by using MyInt, which only use the lower 8 bits for hash
  4. apply same improvements to immut/map

Overview

-priv enum Bucket[T] {
-  JustOne(T) // must be non-empty
-  More(T, Bucket[T])
-}

-enum T[A] {
+priv enum Node[A] {
-  Empty

-  Leaf(A)
-  Collision(Bucket[A])
+  Leaf(A, @list.T[A])

+  Flat(A, Path)

-  Branch(@sparse_array.SparseArray[T[A]])
+  Branch(@sparse_array.SparseArray[Node[A]])
}

+ type T[A] Node[A]?

Copy link
peter-jerry-ye-code-review bot commented Jun 14, 2025
Path comparison in Node::contains might cause infinite loop

Category
Correctness
Code Snippet
fn[A : Eq] Node::contains(self : Node[A], key : A, path : Path) -> Bool {
loop (self, path) {
(Leaf(key1, bucket), _) => key == key1 || bucket.contains(key)
(Flat(key1, path1), path) => path == path1 && key == key1
(Branch(children), path) => {
let idx = path.idx()
if children[idx] is Some(child) {
continue (child, path.next())
}
false
}
}
}
Recommendation
Add a path length check to prevent infinite recursion:

(Branch(children), path) => {
  if path.is_last() { return false }
  let idx = path.idx()
  if children[idx] is Some(child) {
    continue (child, path.next())
  }
  false
}

Reasoning
The loop in Node::contains could potentially continue indefinitely if the path structure is malformed. Adding a check for path.is_last() ensures termination.

Unnecessary allocations in union operation

Category
Performance
Code Snippet
fn[K : Eq] T::union(self : T[K], other : T[K]) -> T[K] {
fn go(node1, node2) {
match (node1, node2) {
(node, Flat(key, path)) | (Flat(key, path), node) =>
node.add_with_path(key, path)
Recommendation
Consider reusing nodes when possible:

fn go(node1, node2) {
  match (node1, node2) {
    (node, Flat(key, path)) | (Flat(key, path), node) =>
      if node.contains(key, path) { node } else { node.add_with_path(key, path) }

Reasoning
The current implementation always creates a new node even when the key already exists in the target node. Checking containment first could avoid unnecessary allocations.

Path implementation details are not well documented

Category
Maintainability
Code Snippet
pub(all) type Path UInt derive(Eq)

const SEGMENT_LENGTH : Int = 5
const INDEX_MASK : UInt = (1 << SEGMENT_LENGTH) - 1
const SEGMENT_NUM : Int = 32 / SEGMENT_LENGTH
const HEAD_TAG : UInt = 0xffffffffU << (SEGMENT_LENGTH * SEGMENT_NUM)
Recommendation
Add detailed documentation explaining the Path type and its bit layout:

///| A Path represents a sequence of 5-bit segments in a 32-bit integer.
/// The format is: HEAD_TAG | segment_n | ... | segment_1 | segment_0
/// where each segment is 5 bits and HEAD_TAG ensures the upper bits are set.
pub(all) type Path UInt derive(Eq)

Reasoning
The Path type is central to the new implementation but its internal representation and constraints are not clearly documented. Better documentation would help maintainers understand the design.

@coveralls
Copy link
Collaborator
coveralls commented Jun 14, 2025

Pull Request Test Coverage Report for Build 166

Details

  • 189 of 220 (85.91%) changed or added relevant lines in 5 files are covered.
  • 2 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 89.243%

Changes Missing Coverage Covered Lines Changed/Added Lines %
immut/internal/sparse_array/sparse_array.mbt 49 56 87.5%
immut/hashset/HAMT.mbt 117 141 82.98%
Files with Coverage Reduction New Missed Lines %
immut/hashset/HAMT.mbt 1 85.03%
immut/internal/sparse_array/sparse_array.mbt 1 89.04%
Totals Coverage Status
Change from base Build 163: -0.1%
Covered Lines: 3476
Relevant Lines: 3895

💛 - Coveralls

@bobzhang bobzhang requested a review from Guest0x0 June 15, 2025 01:14
Copy link
Collaborator
@peter-jerry-ye peter-jerry-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to the experts for the implementation of the algorithm, but here's something I've observed.

@Guest0x0
Copy link
Contributor
8000

btw from #2205 the optimized set operations seem quite independent, I think we should include that part in this PR too, so that there would be no performance degeneration in those operations

@FlyCloudC FlyCloudC force-pushed the perf-immut-set branch 2 times, most recently from 3d148b8 to 1c87007 Compare June 21, 2025 01:33
const SEGMENT_NUM : Int = 32 / SEGMENT_LENGTH

///|
const HEAD_TAG : UInt = 0xffffffffU << (SEGMENT_LENGTH * SEGMENT_NUM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path::of(x) always has the form 0b11_xxxxx_xxxxx_xxxxx_xxxxx_xxxxx.

  • Path::of(x).next().next().next() is 0b11_xxxxx_xxxxx, not satisfy <= 0b11_11111
  • Path::of(x).next().next().next().next() is 0b11_xxxxx, satisfy <= 0b11_11111

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, the new version only utilize the lower 30 bits of the hash value, and the highest 2 bits are used to identify end of hash value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, the new version only utilize the lower 30 bits of the hash value, and the highest 2 bits are used to identify end of hash value?

Yes

}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note: maybe it's possible to also do path compression for Branch (i.e. add an extra path segment (Path + length) parameter) too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note: maybe it's possible to also do path compression for Branch (i.e. add an extra path segment (Path + length) parameter) too?

I think the benefit might be limited. With a good enough hash function, for k in 1~5, the chance of n different elements all have the same kth Path segment is about 1/(32^(n-1)) (assuming 5 bits per segment).

(Collision(xs), Collision(ys)) =>
xs.size() == ys.size() && xs.iter().all(fn(x) { ys.find(x) })
(Leaf(x, xs), Leaf(y, ys)) =>
xs.length() == ys.length() && xs.add(x).iter().all(ys.add(y).contains(_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ys.add(y) will be run multiple times here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the expected behavior for me. I thought f(exp, _) would desugar to

{
  let v1 = exp
  hole => f(v1, hole)
}

I'll fix it. And I suggest documenting this behavior in the language's documentation.

@FlyCloudC FlyCloudC force-pushed the perf-immut-set branch 2 times, most recently from 558deab to ca818e3 Compare June 27, 2025 03:44
@FlyCloudC
Copy link
Contributor Author

(using arrow function)

Copy link
Contributor
@Guest0x0 Guest0x0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR overall looks good to me, remaining TODO:

  • accidental modification of @immut/hashmap below
  • add more comments to the @path package, explaining how the length of path is represented

m.add(kv.0, kv.1)
} else {
m
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should only affect immut/hashset, is this accidentally introduced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not accidental. Since the types of SparseArray::difference and SparseArray::intersection were changed

-fn[X] SparseArray::difference(Self[X], Self[X]) -> Self[X]
+fn[X] SparseArray::difference(Self[X], Self[X], (X, X) -> X?) -> Self[X]?

-fn[X] SparseArray::intersection(Self[X], Self[X], (X, X) -> X raise?) -> Self[X] raise?
+fn[X] SparseArray::intersection(Self[X], Self[X], (X, X) -> X? raise?) -> Self[X]? raise?

the corresponding calls in immut/hashmap were also affected.

@FlyCloudC
Copy link
Contributor Author
  • add more comments to the @path package, explaining how the length of path is represented

Done.

@Guest0x0
Copy link
Contributor
Guest0x0 commented Jul 1, 2025

cleanup the @immut/hashmap change and this PR should be good to go

@FlyCloudC
Copy link
Contributor Author
FlyCloudC commented Jul 1, 2025

cleanup the @immut/hashmap change and this PR should be good to go

How should I clean up? I replied above, this is not an accidental change.

@Guest0x0
Copy link
Contributor
Guest0x0 commented Jul 1, 2025

How should I clean up? I replied above, this is not an accidental change.

I cannot see your comment. If it is under my review comment, is it still pending and not submitted?

@FlyCloudC
Copy link
Contributor Author
FlyCloudC commented Jul 1, 2025

I cannot see your comment. If it is under my review comment, is it still pending and not submitted?

Copy here:

Not accidental. Since the types of SparseArray::difference and SparseArray::intersection were changed

-fn[X] SparseArray::difference(Self[X], Self[X]) -> Self[X]
+fn[X] SparseArray::difference(Self[X], Self[X], (X, X) -> X?) -> Self[X]?

-fn[X] SparseArray::intersection(Self[X], Self[X], (X, X) -> X raise?) -> Self[X] raise?
+fn[X] SparseArray::intersection(Self[X], Self[X], (X, X) -> X? raise?) -> Self[X]? raise?

the corresponding calls in immut/hashmap were also affected.


It seems I still need to learn how to use GitHub, XD


OK, now I know how to handle it. This "pending review" design is seriously user-unfriendly

@Guest0x0 Guest0x0 merged commit efffcdc into moonbitlang:main Jul 1, 2025
10 checks passed
@FlyCloudC FlyCloudC deleted the perf-immut-set branch July 2, 2025 10:48
@FlyCloudC FlyCloudC mentioned this pull request Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0