Refactor immut/{set, map, sparse_array} #2205

FlyCloudC · 2025-06-04T22:28:41Z

Overview

I'm planning to refactor the immut/{set, map, sparse_array} modules. Due to the extensive changes, I've temporarily placed the new implementation in a separate repository: https://github.com/FlyCloudC/HAMT

Bug Fixes

Singleton Leaf Issue
- Fixed incorrect top-level Leaf creation in singleton, which previously forced subsequent add operations to degenerate into Collision nodes (causing O(n) performance).
Difference Operation
- Resolved bugs in difference (originally reported in Add set operations to @immut/hash{map, set} and @internal/sparse_array Summary #2145).

Improvement

Path Compression
- Introduced Flat[K, Path] to eliminate redundant single-leaf Branch nesting, reducing depth and improving lookup/insertion.
Bucket Consolidation
- Replaced Bucket with @list to deduplicate method implementations.
- Removed the previous optimization for single-element leaves (Leaf[K]). Now using Leaf(K, List[K]) to represent non-empty lists. This adds an extra @list.Empty for each bottom-level node (depth=5), but since most leaves will be represented as Flat (without hash collisions), the overhead is acceptable
Internal inplace
- Added internal add_inplace to reduce memory usage during from_array and from_iter

Comapre

brfore:

///|
/// A bucket is a non-empty linked list of key-value pair,
/// used to resolve hash collision in HAMT
priv enum Bucket[T] {
  JustOne(T) // must be non-empty
  More(T, Bucket[T])
}

///|
/// An immutable hash set data structure
enum T[A] {
  Empty
  Leaf(A) // optimize for the case of no collision
  Collision(Bucket[A]) // use a list of buckets to resolve collision
  Branch(@sparse_array.SparseArray[T[A]])
}

after:

///|
/// Represents a non-empty set
/// 
/// Invariance:
/// - If it can be `Flat`, it must be `Flat`
/// - `Branch` nesting depth <= 5
priv enum Node[A] {
  Flat(A, Path)
  Leaf(A, @list.T[A])
  Branch(@sparse_array.T[Node[A]])
} derive(ToJson)

///|
type T[A] Node[A]? derive(Eq)

Pending Work

Implementation of difference operations
immut/map module

bench

fn go(b : @bench.T, size : Int) -> Unit {
  let rs = @quickcheck/splitmix.new()
  let ar : Array[Int] = Array::arbitrary(size, rs)
  let br : Array[Int] = Array::arbitrary(size, rs)
  let a1 = from_array(ar).add(2).remove(3)
  let a2 = @immut/hashset.from_array(ar).add(2).remove(3)
  let b1 = from_array(br)
  let b2 = @immut/hashset.from_array(br)
  b.bench(name="new union", fn() { b.keep(a1.union(b1)) })
  b.bench(name="old union", fn() { b.keep(a2.union(b2)) })
  b.bench(name="new add 2", fn() { b.keep(a1.add(2)) })
  b.bench(name="old add 2", fn() { b.keep(a2.add(2)) })
  b.bench(name="new add x", fn() { b.keep(a1.add(3)) })
  b.bench(name="old add x", fn() { b.keep(a2.add(3)) })
  b.bench(name="new rmv 2", fn() { b.keep(a1.remove(2)) })
  b.bench(name="old rmv 2", fn() { b.keep(a2.remove(2)) })
  b.bench(name="new rmv x", fn() { b.keep(a1.remove(3)) })
  b.bench(name="old rmv x", fn() { b.keep(a2.remove(3)) })
  b.bench(name="new from_arr", fn() { b.keep(from_array(ar)) })
  b.bench(name="old from_arr", fn() { b.keep(@immut/hashset.from_array(ar)) })
  b.bench(name="new iter", fn() { b.keep(a1.iter().count()) })
  b.bench(name="old iter", fn() { b.keep(a2.iter().count()) })
}

test "big" (b : @bench.T) {
  go(b, 10000)
}

test "small" (b : @bench.T) {
  go(b, 42)
}

target wasm-gc

bench FlyCloudC/hamt/immut_set/bench.mbt::big
name         time (mean ± σ)         range (min … max)
new union      36.59 µs ±   1.02 µs    35.57 µs …  38.72 µs  in 10 ×   2830 runs
old union     533.49 µs ± 143.71 µs   403.66 µs … 822.63 µs  in 10 ×    122 runs
new add 2       0.03 µs ±   0.00 µs     0.03 µs …   0.03 µs  in 10 × 100000 runs
old add 2       0.22 µs ±   0.00 µs     0.22 µs …   0.23 µs  in 10 × 100000 runs
new add x       0.20 µs ±   0.04 µs     0.17 µs …   0.29 µs  in 10 × 100000 runs
old add x       0.21 µs ±   0.00 µs     0.21 µs …   0.22 µs  in 10 × 100000 runs
new rmv 2       0.18 µs ±   0.00 µs     0.18 µs …   0.19 µs  in 10 × 100000 runs
old rmv 2       0.17 µs ±   0.00 µs     0.17 µs …   0.18 µs  in 10 × 100000 runs
new rmv x       0.03 µs ±   0.00 µs     0.03 µs …   0.03 µs  in 10 × 100000 runs
old rmv x       0.16 µs ±   0.00 µs     0.15 µs …   0.17 µs  in 10 × 100000 runs
new from_arr  602.78 µs ±   9.48 µs   586.60 µs … 615.53 µs  in 10 ×    160 runs
old from_arr    4.59 ms ± 130.22 µs     4.42 ms …   4.78 ms  in 10 ×     22 runs
new iter       40.72 µs ±   0.19 µs    40.49 µs …  40.99 µs  in 10 ×   2289 runs
old iter      896.34 µs ±  67.34 µs   738.98 µs … 979.46 µs  in 10 ×     94 runs

bench FlyCloudC/hamt/immut_set/bench.mbt::small
name         time (mean ± σ)         range (min … max)
new union       0.10 µs ±   0.00 µs     0.10 µs …   0.11 µs  in 10 × 100000 runs
old union       0.68 µs ±   0.01 µs     0.67 µs …   0.69 µs  in 10 × 100000 runs
new add 2       0.02 µs ±   0.00 µs     0.02 µs …   0.02 µs  in 10 × 100000 runs
old add 2       0.13 µs ±   0.00 µs     0.12 µs …   0.13 µs  in 10 × 100000 runs
new add x       0.04 µs ±   0.00 µs     0.03 µs …   0.04 µs  in 10 × 100000 runs
old add x       0.11 µs ±   0.00 µs     0.11 µs …   0.12 µs  in 10 × 100000 runs
new rmv 2       0.04 µs ±   0.00 µs     0.04 µs …   0.05 µs  in 10 × 100000 runs
old rmv 2       0.06 µs ±   0.00 µs     0.06 µs …   0.06 µs  in 10 × 100000 runs
new rmv x       0.02 µs ±   0.00 µs     0.02 µs …   0.02 µs  in 10 × 100000 runs
old rmv x       0.01 µs ±   0.00 µs     0.01 µs …   0.02 µs  in 10 × 100000 runs
new from_arr    0.09 µs ±   0.00 µs     0.09 µs …   0.10 µs  in 10 × 100000 runs
old from_arr    0.69 µs ±   0.01 µs     0.67 µs …   0.70 µs  in 10 × 100000 runs
new iter        0.04 µs ±   0.00 µs     0.04 µs …   0.05 µs  in 10 × 100000 runs
old iter        0.12 µs ±   0.00 µs     0.12 µs …   0.13 µs  in 10 × 100000 runs

coveralls · 2025-06-04T22:31:18Z

Pull Request Test Coverage Report for Build 7139

Details

272 of 283 (96.11%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.3%) to 93.481%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
immut/internal/sparse_array/sparse_array.mbt	69	70	98.57%
immut/hashset/hashset.mbt	81	85	95.29%
immut/hashset/node.mbt	93	99	93.94%

Totals
Change from base Build 7124:	0.3%
Covered Lines:	8661
Relevant Lines:	9265

💛 - Coveralls

FlyCloudC · 2025-06-05T13:19:10Z

Although we could use List[K] for @immut/set, we lack a dual-element List[K, V] equivalent for @immut/map. Using List[(K, V)] would introduce an extra unboxing overhead. Fortunately, in most cases(size < 1500000), Leaf nodes account for no more than 6% of cases (see tests).

test {
  let rs = @quickcheck/splitmix.new()
  fn f(size) {
    let a : T[Int] = Arbitrary::arbitrary(size, rs)
    let d = [0, 0, 0, 0, 0, 0, 0]
    a.each(fn(x) { d[a.depth_of(x)] += 1 })
    (a.size(), d)
  }

  inspect(f(100), content="(44, [0, 9, 33, 2, 0, 0, 0])")
  inspect(f(1000), content="(31, [0, 17, 14, 0, 0, 0, 0])")
  inspect(f(10000), content="(4023, [0, 0, 71, 3474, 464, 14, 0])")
  inspect(f(100000), content="(28305, [0, 0, 0, 11785, 15720, 776, 24])")
  inspect(f(1000000), content="(322608, [0, 0, 0, 13, 237231, 82405, 2959])")
  inspect(f(2000000), content="(1149207, [0, 0, 0, 0, 383947, 727437, 37823])")
  inspect(f(4000000), content="(1513468, [0, 0, 0, 0, 357127, 1089929, 66412])")
}

peter-jerry-ye-code-review · 2025-06-07T15:49:10Z

The new implementation could better document the hash collision handling behavior

Category
Maintainability
Code Snippet
priv enum Node[A] {
Flat(A, Path)
Leaf(A, List[A])
Branch(SpArr[Node[A]])
}
Recommendation
Add documentation explaining how hash collisions are handled in Leaf nodes and what are invariants around Path
Reasoning
The code handling hash collisions is critical for correctness and performance. While the PR improves the implementation, the documentation could better explain how different Node variants handle collisions and what invariants need to be maintained.

union/intersection/difference operations could be optimized with early exits

Category
Performance
Code Snippet
pub fn[A : Eq] T::union(self : T[A], other : T[A]) -> T[A] {
match (self., other.) {
(None, x) | (x, None) => x
(Some(a), Some(b)) => Some(go(a, b))
}
}
Recommendation
Add fast paths for common cases like when sets are disjoint or one is subset of another by checking sizes first
Reasoning
The current implementation always traverses both trees even in cases where we could determine the result faster by comparing sizes or checking disjointness first. This optimization would improve performance for common cases.

Potential missing error handling in Path operations

Category
Correctness
Code Snippet
pub fn[A : Hash] T::of(key : A) -> T {
key.hash().reinterpret_as_uint() | HEAD_TAG
}
Recommendation
Add checks to ensure the hash value fits within expected bit ranges and handle edge cases
Reasoning
The Path implementation relies on bit manipulation of hash values but doesn't appear to validate the inputs. This could cause issues with certain hash values that don't meet the bit length assumptions.

FlyCloudC · 2025-06-07T16:24:56Z

I've completed porting immut/hashset from my temp repo. Before proceeding with immut/hashmap, I'd like to get review feedback to avoid duplicate fixes.

immut/hashset/hashset.mbti

…p unused tests

peter-jerry-ye · 2025-06-13T08:39:11Z

Could you please split this PR into smaller patches so that we can review more easily? Thank you.

FlyCloudC mentioned this pull request Jun 6, 2025

Add map and map_with_key Methods to Map #2210

Closed

FlyCloudC force-pushed the refactor-hamt branch from c8c422a to b01e828 Compare June 7, 2025 15:48

FlyCloudC force-pushed the refactor-hamt branch from b01e828 to 391341b Compare June 7, 2025 16:10

FlyCloudC force-pushed the refactor-hamt branch from a94648d to c138b32 Compare June 8, 2025 18:56

FlyCloudC marked this pull request as ready for review June 8, 2025 20:47

FlyCloudC force-pushed the refactor-hamt branch from c138b32 to 8a353fe Compare June 8, 2025 21:12

FlyCloudC commented Jun 8, 2025

View reviewed changes

immut/hashset/hashset.mbti Show resolved Hide resolved

hackwaly requested a review from Guest0x0 June 10, 2025 06:36

FlyCloudC added 14 commits June 11, 2025 18:06

refactor(Bitset): remove public visibility from methods and clean u…

d762403

…p unused tests

refactor(SparseArray): rename struct and update methods

714e914

update test

e501d90

add Path pkg

5540309

perf(immut/hashset): implement path compression

a0280bb

refactor(SparseArray): remove unused first_value method

0a4934c

reorder trait constraints

13105b9

refactor: replace method with function

6a22cfb

moon info

050a4fe

fix: temporary workaround

94e255b

add license header

75174c8

tweak

986d430

test(immut/hashset): improve coverage

db7a7df

remove usage of loop with multi param

e6e74c2

FlyCloudC force-pushed the refactor-hamt branch from 8a353fe to e6e74c2 Compare June 11, 2025 10:17

FlyCloudC mentioned this pull request Jun 14, 2025

Perform immut/{set, sparse_array} #2272

Merged

FlyCloudC closed this Jun 21, 2025

FlyCloudC deleted the refactor-hamt branch July 2, 2025 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor immut/{set, map, sparse_array} #2205

Refactor immut/{set, map, sparse_array} #2205

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Refactor immut/{set, map, sparse_array} #2205

Refactor immut/{set, map, sparse_array} #2205

Uh oh!

Conversation

Uh oh!

Overview

Bug Fixes

Improvement

Comapre

Pending Work

bench

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 7139

Details

💛 - Coveralls

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!