10000 Release v0.24.0 · ml-explore/mlx · GitHub

More Web Proxy on the site http://driver.im/

v0.24.0

jagrit06 released this 20 Mar 22:31

· 160 commits to main since this release

1177d28

Highlights

Much faster fused attention with support for causal masking
- Benchmarks
- Improvements in prompt processing speed and memory use, benchmarks
- Much faster small batch fused attention for e.g. speculative decoding, benchmarks
Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

Support fused masking in scaled_dot_product_attention
Support transposed head/seq for fused vector scaled_dot_product_attention
SDPA support for small batch (over sequence) queries
Enabling fused attention for head dim 128
Redesign CPU back-end for faster cpu/gpu synch

Features

Allow debugging in distributed mode
Support mx.fast.rms_norm without scale
Adds nuclear norm support in mx.linalg.norm
Add XOR on arrays
Added mlx::core::version()
Allow non-square lu in mx.linalg.lu
Double for lapack ops (eigh, svd, etc)
Add a prepare tb ring script
Ring docs
Affine quant always in fp32

Optimizers

Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

Do not define MLX_VERSION globally
Reduce binary size post fast synch
Fix vmap for flatten
Fix copy for large arrays with JIT
Fix grad with inplace updates
Use same accumulation precision in gemv as gemm
Fix slice data size
Use a heap for small sizes
Fix donation in scan
Ensure linspace always contains start and stop
Raise an exception in the rope op if input is integer
Limit compile buffers by
fix mx.float64 type promotion
Fix CPU SIMD erf_inv
Update smooth_l1_loss in losses.

Assets 2

0