Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.25.2
v0.25.1
v0.25.0
Highlights
- Custom logsumexp for reduced memory in training (benchmark)
- Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
- Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
Core
Performance
- Fused vector attention supports 256 dim
- Tune quantized matrix vector dispatch for small batches of vectors
Features
- Move memory API in the top level mlx.core and enable for CPU only allocator
- Enable using MPI from all platforms and allow only OpenMPI
- Add a ring all gather for the ring distributed backend
- Enable gemm for complex numbers
- Fused attention supports literal "causal" mask
- Log for complex numbers
- Distributed
all_min
andall_max
both for MPI and the ring backend - Add
logcumsumexp
- Add additive mask for fused vector attention
- Improve the usage of the residency set
NN
- Add sharded layers for model/tensor parallelism
Bugfixes
- Fix possible allocator deadlock when using multiple streams
- Ring backend supports 32 bit platforms and FreeBSD
- Fix FFT bugs
- Fix attention mask type for fused attention kernel
- Fix fused attention numerical instability with masking
- Add a fallback for float16 gemm
- Fix simd sign for uint64
- Fix issues in docs
v0.24.2
v0.24.1
v0.24.0
Highlights
- Much faster fused attention with support for causal masking
- Benchmarks
- Improvements in prompt processing speed and memory use, benchmarks
- Much faster small batch fused attention for e.g. speculative decoding, benchmarks
- Major redesign of CPU back-end for faster CPU-GPU synchronization
Core
Performance
- Support fused masking in
scaled_dot_product_attention
- Support transposed head/seq for fused vector
scaled_dot_product_attention
- SDPA support for small batch (over sequence) queries
- Enabling fused attention for head dim 128
- Redesign CPU back-end for faster cpu/gpu synch
Features
- Allow debugging in distributed mode
- Support
mx.fast.rms_norm
without scale - Adds nuclear norm support in
mx.linalg.norm
- Add XOR on arrays
- Added
mlx::core::version()
- Allow non-square lu in
mx.linalg.lu
- Double for lapack ops (
eigh
,svd
, etc) - Add a prepare tb ring script
- Ring docs
- Affine quant always in fp32
Optimizers
- Add a multi optimizer
optimizers.MultiOptimizer
Bug Fixes
- Do not define
MLX_VERSION
globally - Reduce binary size post fast synch
- Fix vmap for flatten
- Fix copy for large arrays with JIT
- Fix grad with inplace updates
- Use same accumulation precision in gemv as gemm
- Fix slice data size
- Use a heap for small sizes
- Fix donation in scan
- Ensure linspace always contains start and stop
- Raise an exception in the rope op if input is integer
- Limit compile buffers by
- fix
mx.float64
type promotion - Fix CPU SIMD erf_inv
- Update smooth_l1_loss in losses.
v0.23.2
v0.23.1
π
v0.23.0
Highlights
- 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
- More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster
mx.put_along_axis
andmx.take_along_axis
, benchmarks - Faster unified CPU back-end with vector operations
- Double precision (
mx.float64
) support on the CPU
Core
Features
- Bitwise invert
mx.bitwise_invert
mx.linalg.lu
,mx.linalg.lu_factor
,mx.linalg.solve
,mx.linalg.solve_triangular
- Support loading F8_E4M3 from safetensors
mx.float64
supported on the CPU- Matmul JVPs
- Distributed launch helper :
mlx.launch
- Support non-square QR factorization with
mx.linalg.qr
- Support ellipsis in
mx.einsum
- Refactor and unify accelerate and common back-ends
Performance
- Faster synchronization
Fence
for synchronizing CPU-GPU - Much faster
mx.put_along_axis
andmx.take_along_axis
, benchmarks - Fast winograd convolutions, benchmarks
- Allow dynamic ops per buffer based on dispatches and memory, benchmarks
- Up to 3x faster sort, benchmarks
- Faster small batch qmv, benchmarks
- Ring distributed backend
- Uses raw sockets for faster all reduce
- Some CPU ops are much faster with the new
Simd<T, N>
NN
- Orthogonal initializer
nn.init.orthogonal
- Add dilation for conv 3d layers
Bug fixes
- Limit grad recursion depth by not recursing through non-grad inputs
- Fix synchronization bug for GPU stream async CPU work
- Fix shapeless compile on ubuntu24
- Recompile when
shapeless
changes - Fix rope fallback to not upcast
- Fix metal sort for certain cases
- Fix a couple of slicing bugs
- Avoid duplicate malloc with custom kernel init
- Fix compilation error on Windows
- Allow Python garbage collector to break cycles on custom objects
- Fix grad with copies
- Loading empty list is ok when
strict = false
- Fix split vmap
- Fixes output donation for IO ops on the GPU
- Fix creating an array with an int64 scalar
- Catch stream errors earlier to avoid aborts