8000 Refactor immut/{set, map, sparse_array} by FlyCloudC · Pull Request #2205 · moonbitlang/core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Refactor immut/{set, map, sparse_array} #2205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

FlyCloudC
Copy link
Contributor
@FlyCloudC FlyCloudC commented Jun 4, 2025

Overview

I'm planning to refactor the immut/{set, map, sparse_array} modules. Due to the extensive changes, I've temporarily placed the new implementation in a separate repository: https://github.com/FlyCloudC/HAMT

Bug Fixes

  1. Singleton Leaf Issue​​
    • Fixed incorrect top-level Leaf creation in singleton, which previously forced subsequent add operations to degenerate into Collision nodes (causing O(n) performance).
  2. Difference Operation​​

Improvement

  1. Path Compression​​
    • Introduced Flat[K, Path] to eliminate redundant single-leaf Branch nesting, reducing depth and improving lookup/insertion.
  2. Bucket Consolidation​​
    • Replaced Bucket with @list to deduplicate method implementations.
    • Removed the previous optimization for single-element leaves (Leaf[K]). Now using Leaf(K, List[K]) to represent non-empty lists. This adds an extra @list.Empty for each bottom-level node (depth=5), but since most leaves will be represented as Flat (without hash collisions), the overhead is acceptable
  3. Internal inplace
    • Added internal add_inplace to reduce memory usage during from_array and from_iter

Comapre

brfore:

///|
/// A bucket is a non-empty linked list of key-value pair,
/// used to resolve hash collision in HAMT
priv enum Bucket[T] {
  JustOne(T) // must be non-empty
  More(T, Bucket[T])
}

///|
/// An immutable hash set data structure
enum T[A] {
  Empty
  Leaf(A) // optimize for the case of no collision
  Collision(Bucket[A]) // use a list of buckets to resolve collision
  Branch(@sparse_array.SparseArray[T[A]])
}

after:

///|
/// Represents a non-empty set
/// 
/// Invariance:
/// - If it can be `Flat`, it must be `Flat`
/// - `Branch` nesting depth <= 5
priv enum Node[A] {
  Flat(A, Path)
  Leaf(A, @list.T[A])
  Branch(@sparse_array.T[Node[A]])
} derive(ToJson)

///|
type T[A] Node[A]? derive(Eq)

Pending Work

  • Implementation of difference operations
  • immut/map module

bench

fn go(b : @bench.T, size : Int) -> Unit {
  let rs = @quickcheck/splitmix.new()
  let ar : Array[Int] = Array::arbitrary(size, rs)
  let br : Array[Int] = Array::arbitrary(size, rs)
  let a1 = from_array(ar).add(2).remove(3)
  let a2 = @immut/hashset.from_array(ar).add(2).remove(3)
  let b1 = from_array(br)
  let b2 = @immut/hashset.from_array(br)
  b.bench(name="new union", fn() { b.keep(a1.union(b1)) })
  b.bench(name="old union", fn() { b.keep(a2.union(b2)) })
  b.bench(name="new add 2", fn() { b.keep(a1.add(2)) })
  b.bench(name="old add 2", fn() { b.keep(a2.add(2)) })
  b.bench(name="new add x", fn() { b.keep(a1.add(3)) })
  b.bench(name="old add x", fn() { b.keep(a2.add(3)) })
  b.bench(name="new rmv 2", fn() { b.keep(a1.remove(2)) })
  b.bench(name="old rmv 2", fn() { b.keep(a2.remove(2)) })
  b.bench(name="new rmv x", fn() { b.keep(a1.remove(3)) })
  b.bench(name="old rmv x", fn() { b.keep(a2.remove(3)) })
  b.bench(name="new from_arr", fn() { b.keep(from_array(ar)) })
  b.bench(name="old from_arr", fn() { b.keep(@immut/hashset.from_array(ar)) })
  b.bench(name="new iter", fn() { b.keep(a1.iter().count()) })
  b.bench(name="old iter", fn() { b.keep(a2.iter().count()) })
}

test "big" (b : @bench.T) {
  go(b, 10000)
}

test "small" (b : @bench.T) {
  go(b, 42)
}
target wasm-gc

bench FlyCloudC/hamt/immut_set/bench.mbt::big
name         time (mean ± σ)         range (min … max)
new union      36.59 µs ±   1.02 µs    35.57 µs …  38.72 µs  in 10 ×   2830 runs
old union     533.49 µs ± 143.71 µs   403.66 µs … 822.63 µs  in 10 ×    122 runs
new add 2       0.03 µs ±   0.00 µs     0.03 µs …   0.03 µs  in 10 × 100000 runs
old add 2       0.22 µs ±   0.00 µs     0.22 µs …   0.23 µs  in 10 × 100000 runs
new add x       0.20 µs ±   0.04 µs     0.17 µs …   0.29 µs  in 10 × 100000 runs
old add x       0.21 µs ±   0.00 µs     0.21 µs …   0.22 µs  in 10 × 100000 runs
new rmv 2       0.18 µs ±   0.00 µs     0.18 µs …   0.19 µs  in 10 × 100000 runs
old rmv 2       0.17 µs ±   0.00 µs     0.17 µs …   0.18 µs  in 10 × 100000 runs
new rmv x       0.03 µs ±   0.00 µs     0.03 µs …   0.03 µs  in 10 × 100000 runs
old rmv x       0.16 µs ±   0.00 µs     0.15 µs …   0.17 µs  in 10 × 100000 runs
new from_arr  602.78 µs ±   9.48 µs   586.60 µs … 615.53 µs  in 10 ×    160 runs
old from_arr    4.59 ms ± 130.22 µs     4.42 ms …   4.78 ms  in 10 ×     22 runs
new iter       40.72 µs ±   0.19 µs    40.49 µs …  40.99 µs  in 10 ×   2289 runs
old iter      896.34 µs ±  67.34 µs   738.98 µs … 979.46 µs  in 10 ×     94 runs

bench FlyCloudC/hamt/immut_set/bench.mbt::small
name         time (mean ± σ)         range (min … max)
new union       0.10 µs ±   0.00 µs     0.10 µs …   0.11 µs  in 10 × 100000 runs
old union       0.68 µs ±   0.01 µs     0.67 µs …   0.69 µs  in 10 × 100000 runs
new add 2       0.02 µs ±   0.00 µs     0.02 µs …   0.02 µs  in 10 × 100000 runs
old add 2       0.13 µs ±   0.00 µs     0.12 µs …   0.13 µs  in 10 × 100000 runs
new add x       0.04 µs ±   0.00 µs     0.03 µs …   0.04 µs  in 10 × 100000 runs
old add x       0.11 µs ±   0.00 µs     0.11 µs …   0.12 µs  in 10 × 100000 runs
new rmv 2       0.04 µs ±   0.00 µs     0.04 µs …   0.05 µs  in 10 × 100000 runs
old rmv 2       0.06 µs ±   0.00 µs     0.06 µs …   0.06 µs  in 10 × 100000 runs
new rmv x       0.02 µs ±   0.00 µs     0.02 µs …   0.02 µs  in 10 × 100000 runs
old rmv x       0.01 µs ±   0.00 µs     0.01 µs …   0.02 µs  in 10 × 100000 runs
new from_arr    0.09 µs ±   0.00 µs     0.09 µs …   0.10 µs  in 10 × 100000 runs
old from_arr    0.69 µs ±   0.01 µs     0.67 µs …   0.70 µs  in 10 × 100000 runs
new iter        0.04 µs ±   0.00 µs     0.04 µs …   0.05 µs  in 10 × 100000 runs
old iter        0.12 µs ±   0.00 µs     0.12 µs …   0.13 µs  in 10 × 100000 runs

@coveralls
Copy link
Collaborator
coveralls commented Jun 4, 2025

Pull Request Test Coverage Report for Build 7139

Details

  • 272 of 283 (96.11%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 93.481%

Changes Missing Coverage Covered Lines Changed/Added Lines %
immut/internal/sparse_array/sparse_array.mbt 69 70 98.57%
immut/hashset/hashset.mbt 81 85 95.29%
immut/hashset/node.mbt 93 99 93.94%
Totals Coverage Status
Change from base Build 7124: 0.3%
Covered Lines: 8661
Relevant Lines: 9265

💛 - Coveralls

@FlyCloudC
Copy link
Contributor Author
FlyCloudC commented Jun 5, 2025

Although we could use List[K] for @immut/set, we lack a dual-element List[K, V] equivalent for @immut/map. Using List[(K, V)] would introduce an extra unboxing overhead. Fortunately, in most cases(size < 1500000), Leaf nodes account for no more than 6% of cases (see tests).

test {
  let rs = @quickcheck/splitmix.new()
  fn f(size) {
    let a : T[Int] = Arbitrary::arbitrary(size, rs)
    let d = [0, 0, 0, 0, 0, 0, 0]
    a.each(fn(x) { d[a.depth_of(x)] += 1 })
    (a.size(), d)
  }

  inspect(f(100), content="(44, [0, 9, 33, 2, 0, 0, 0])")
  inspect(f(1000), content="(31, [0, 17, 14, 0, 0, 0, 0])")
  inspect(f(10000), content="(4023, [0, 0, 71, 3474, 464, 14, 0])")
  inspect(f(100000), content="(28305, [0, 0, 0, 11785, 15720, 776, 24])")
  inspect(f(1000000), content="(322608, [0, 0, 0, 13, 237231, 82405, 2959])")
  inspect(f(2000000), content="(1149207, [0, 0, 0, 0, 383947, 727437, 37823])")
  inspect(f(4000000), content="(1513468, [0, 0, 0, 0, 357127, 1089929, 66412])")
}

Copy link
peter-jerry-ye-code-review bot commented Jun 7, 2025
The new implementation could better document the hash collision handling behavior

Category
Maintainability
Code Snippet
priv enum Node[A] {
Flat(A, Path)
Leaf(A, List[A])
Branch(SpArr[Node[A]])
}
Recommendation
Add documentation explaining how hash collisions are handled in Leaf nodes and what are invariants around Path
Reasoning
The code handling hash collisions is critical for correctness and performance. While the PR improves the implementation, the documentation could better explain how different Node variants handle collisions and what invariants need to be maintained.

union/intersection/difference operations could be optimized with early exits

Category
Performance
Code Snippet
pub fn[A : Eq] T::union(self : T[A], other : T[A]) -> T[A] {
match (self., other.) {
(None, x) | (x, None) => x
(Some(a), Some(b)) => Some(go(a, b))
}
}
Recommendation
Add fast paths for common cases like when sets are disjoint or one is subset of another by checking sizes first
Reasoning
The current implementation always traverses both trees even in cases where we could determine the result faster by comparing sizes or checking disjointness first. This optimization would improve performance for common cases.

Potential missing error handling in Path operations

Category
Correctness
Code Snippet
pub fn[A : Hash] T::of(key : A) -> T {
key.hash().reinterpret_as_uint() | HEAD_TAG
}
Recommendation
Add checks to ensure the hash value fits within expected bit ranges and handle edge cases
Reasoning
The Path implementation relies on bit manipulation of hash values but doesn't appear to validate the inputs. This could cause issues with certain hash values that don't meet the bit length assumptions.

@FlyCloudC
Copy link
Contributor Author

I've completed porting immut/hashset from my temp repo. Before proceeding with immut/hashmap, I'd like to get review feedback to avoid duplicate fixes.

@FlyCloudC FlyCloudC marked this pull request as ready for review June 8, 2025 20:47
@hackwaly hackwaly requested a review from Guest0x0 June 10, 2025 06:36
@peter-jerry-ye
Copy link
Collaborator

Could you please split this PR into smaller patches so that we can review more easily? Thank you.

@FlyCloudC FlyCloudC deleted the refactor-hamt branch July 2, 2025 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0