8000 Releases Β· ml-explore/mlx Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: ml-explore/mlx

v0.25.2

09 May 21:35
659a519
Compare
Choose a tag to compare

πŸš€

v0.25.1

24 Apr 23:11
eaf709b
Compare
Choose a tag to compare

πŸš€

v0.25.0

17 Apr 23:50
b529515
Compare
Choose a tag to compare

Highlights

  • Custom logsumexp for reduced memory in training (benchmark)
  • Depthwise separable convolutions
  • Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs

Core

Performance

  • Fused vector attention supports 256 dim
  • Tune quantized matrix vector dispatch for small batches of vectors

Features

  • Move memory API in the top level mlx.core and enable for CPU only allocator
  • Enable using MPI from all platforms and allow only OpenMPI
  • Add a ring all gather for the ring distributed backend
  • Enable gemm for complex numbers
  • Fused attention supports literal "causal" mask
  • Log for complex numbers
  • Distributed all_min and all_max both for MPI and the ring backend
  • Add logcumsumexp
  • Add additive mask for fused vector attention
  • Improve the usage of the residency set

NN

  • Add sharded layers for model/tensor parallelism

Bugfixes

  • Fix possible allocator deadlock when using multiple streams
  • Ring backend supports 32 bit platforms and FreeBSD
  • Fix FFT bugs
  • Fix attention mask type for fused attention kernel
  • Fix fused attention numerical instability with masking
  • Add a fallback for float16 gemm
  • Fix simd sign for uint64
  • Fix issues in docs

v0.24.2

03 Apr 20:18
86389bf
Compare
Choose a tag to compare

πŸ› πŸš€

v0.24.1

24 Mar 20:19
aba899c
Compare
Choose a tag to compare

πŸ›

v0.24.0

20 Mar 22:31
1177d28
Compare
Choose a tag to compare

Highlights

  • Much faster fused attention with support for causal masking
    • Benchmarks
    • Improvements in prompt processing speed and memory use, benchmarks
    • Much faster small batch fused attention for e.g. speculative decoding, benchmarks
  • Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

  • Support fused masking in scaled_dot_product_attention
  • Support transposed head/seq for fused vector scaled_dot_product_attention
  • SDPA support for small batch (over sequence) queries
  • Enabling fused attention for head dim 128
  • Redesign CPU back-end for faster cpu/gpu synch

Features

  • Allow debugging in distributed mode
  • Support mx.fast.rms_norm without scale
  • Adds nuclear norm support in mx.linalg.norm
  • Add XOR on arrays
  • Added mlx::core::version()
  • Allow non-square lu in mx.linalg.lu
  • Double for lapack ops (eigh, svd, etc)
  • Add a prepare tb ring script
  • Ring docs
  • Affine quant always in fp32

Optimizers

  • Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

  • Do not define MLX_VERSION globally
  • Reduce binary size post fast synch
  • Fix vmap for flatten
  • Fix copy for large arrays with JIT
  • Fix grad with inplace updates
  • Use same accumulation precision in gemv as gemm
  • Fix slice data size
  • Use a heap for small sizes
  • Fix donation in scan
  • Ensure linspace always contains start and stop
  • Raise an exception in the rope op if input is integer
  • Limit compile buffers by
  • fix mx.float64 type promotion
  • Fix CPU SIMD erf_inv
  • Update smooth_l1_loss in losses.

v0.23.2

05 Mar 21:24
f599c11
Compare
Choose a tag to compare

πŸš€

v0.23.1

19 Feb 01:53
71de73a
Compare
Choose a tag to compare

🐞

v0.23.0

14 Feb 21:39
6cec78d
Compare
Choose a tag to compare

Highlights

  • 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
  • More performance improvements across the board:
    • Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
    • Faster winograd convolutions, benchmarks
    • Up to 3x faster sort, benchmarks
    • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
    • Faster unified CPU back-end with vector operations
  • Double precision (mx.float64) support on the CPU

Core

Features

  • Bitwise invert mx.bitwise_invert
  • mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
  • Support loading F8_E4M3 from safetensors
  • mx.float64 supported on the CPU
  • Matmul JVPs
  • Distributed launch helper :mlx.launch
  • Support non-square QR factorization with mx.linalg.qr
  • Support ellipsis in mx.einsum
  • Refactor and unify accelerate and common back-ends

Performance

  • Faster synchronization Fence for synchronizing CPU-GPU
  • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
  • Fast winograd convolutions, benchmarks
  • Allow dynamic ops per buffer based on dispatches and memory, benchmarks
  • Up to 3x faster sort, benchmarks
  • Faster small batch qmv, benchmarks
  • Ring distributed backend
  • Some CPU ops are much faster with the new Simd<T, N>

NN

  • Orthogonal initializer nn.init.orthogonal
  • Add dilation for conv 3d layers

Bug fixes

  • Limit grad recursion depth by not recursing through non-grad inputs
  • Fix synchronization bug for GPU stream async CPU work
  • Fix shapeless compile on ubuntu24
  • Recompile when shapeless changes
  • Fix rope fallback to not upcast
  • Fix metal sort for certain cases
  • Fix a couple of slicing bugs
  • Avoid duplicate malloc with custom kernel init
  • Fix compilation error on Windows
  • Allow Python garbage collector to break cycles on custom objects
  • Fix grad with copies
  • Loading empty list is ok when strict = false
  • Fix split vmap
  • Fixes output donation for IO ops on the GPU
  • Fix creating an array with an int64 scalar
  • Catch stream errors earlier to avoid aborts

v0.22.1

06 Feb 20:10
1a1b210
Compare
Choose a tag to compare

πŸš€

0