8000 Releases Β· ml-explore/mlx Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: ml-explore/mlx

v0.26.3

08 Jul 21:26
fb4e8b8
Compare
Choose a tag to compare

πŸš€

v0.26.2

01 Jul 22:08
58f3860
Compare
Choose a tag to compare

πŸš€

v0.26.0

02 Jun 23:24
0408ba0
Compare
Choose a tag to compare

Highlights

  • 5 bit quantization
  • Significant progress on CUDA back-end by @zcbenz

Core

Features

  • 5bit quants
  • Allow per-target Metal debug flags
  • Add complex eigh
  • reduce vjp for mx.all and mx.any
  • real and imag properties
  • Non-symmetric mx.linalg.eig and mx.linalg.eigh
  • convolution vmap
  • Add more complex unary ops (sqrt, square, ...)
  • Complex scan
  • Add mx.broadcast_shapes
  • Added output_padding parameters in conv_transpose
  • Add random normal distribution for complex numbers
  • Add mx.fft.fftshift and mx.fft.ifftshift` helpers
  • Enable vjp for quantized scale and bias

Performance

  • Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
  • Much faster 1D conv

Cuda

  • Generalize gpu backend
  • Use fallbacks in fast primitives when eval_gpu is not implemented
  • Add memory cache to CUDA backend
  • Do not check event.is_signaled() in eval_impl
  • Build for compute capability 70 instead of 75 in CUDA backend
  • CUDA backend: backbone

Bug Fixes

  • Fix out-of-bounds default value in logsumexp/softmax
  • include mlx::core::version() symbols in the mlx static library
  • Fix Nearest upsample
  • Fix large arg reduce
  • fix conv grad
  • Fix some complex vjps
  • Fix typo in row_reduce_small
  • Fix put_along_axis for empty arrays
  • Close a couple edge case bugs: hadamard and addmm on empty inputs
  • Fix fft for integer overflow with large batches
  • fix: conv_general differences between gpu, cpu
  • Fix batched vector sdpa
  • GPU Hadamard for large N
  • Improve bandwidth for elementwise ops
  • Fix compile merging
  • Fix shapeless export to throw on dim mismatch
  • Fix mx.linalg.pinv for singular matrices
  • Fixed shift operations
  • Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

v0.25.2

09 May 21:35
659a519
Compare
Choose a tag to compare

πŸš€

v0.25.1

24 Apr 23:11
eaf709b
Compare
Choose a tag to compare

πŸš€

v0.25.0

17 Apr 23:50
b529515
Compare
Choose a tag to compare

Highlights

  • Custom logsumexp for reduced memory in training (benchmark)
  • Depthwise separable convolutions
  • Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs

Core

Performance

  • Fused vector attention supports 256 dim
  • Tune quantized matrix vector dispatch for small batches of vectors

Features

  • Move memory API in the top level mlx.core and enable for CPU only allocator
  • Enable using MPI from all platforms and allow only OpenMPI
  • Add a ring all gather for the ring distributed backend
  • Enable gemm for complex numbers
  • Fused attention supports literal "causal" mask
  • Log for complex numbers
  • Distributed all_min and all_max both for MPI and the ring backend
  • Add logcumsumexp
  • Add additive mask for fused vector attention
  • Improve the usage of the residency set

NN

  • Add sharded layers for model/tensor parallelism

Bugfixes

  • Fix possible allocator deadlock when using multiple streams
  • Ring backend supports 32 bit platforms and FreeBSD
  • Fix FFT bugs
  • Fix attention mask type for fused attention kernel
  • Fix fused attention numerical instability with masking
  • Add a fallback for float16 gemm
  • Fix simd sign for uint64
  • Fix issues in docs

v0.24.2

03 Apr 20:18
86389bf
Compare
Choose a tag to compare

πŸ› πŸš€

v0.24.1

24 Mar 20:19
aba899c
Compare
Choose a tag to compare

πŸ›

v0.24.0

20 Mar 22:31
1177d28
Compare
Choose a tag to compare

Highlights

  • Much faster fused attention with support for causal masking
    • Benchmarks
    • Improvements in prompt processing speed and memory use, benchmarks
    • Much faster small batch fused attention for e.g. speculative decoding, benchmarks
  • Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

  • Support fused masking in scaled_dot_product_attention
  • Support transposed head/seq for fused vector scaled_dot_product_attention
  • SDPA support for small batch (over sequence) queries
  • Enabling fused attention for head dim 128
  • Redesign CPU back-end for faster cpu/gpu synch

Features

  • Allow debugging in distributed mode
  • Support mx.fast.rms_norm without scale
  • Adds nuclear norm support in mx.linalg.norm
  • Add XOR on arrays
  • Added mlx::core::version()
  • Allow non-square lu in mx.linalg.lu
  • Double for lapack ops (eigh, svd, etc)
  • Add a prepare tb ring script
  • Ring docs
  • Affine quant always in fp32

Optimizers

  • Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

  • Do not define MLX_VERSION globally
  • Reduce binary size post fast synch
  • Fix vmap for flatten
  • Fix copy for large arrays with JIT
  • Fix grad with inplace updates
  • Use same accumulation precision in gemv as gemm
  • Fix slice data size
  • Use a heap for small sizes
  • Fix donation in scan
  • Ensure linspace always contains start and stop
  • Raise an exception in the rope op if input is integer
  • Limit compile buffers by
  • fix mx.float64 type promotion
  • Fix CPU SIMD erf_inv
  • Update smooth_l1_loss in losses.

v0.23.2

05 Mar 21:24
f599c11
Compare
Choose a tag to compare

πŸš€

0