Deep learning at the speed of light.
Luminal is a deep learning library that prioritizes static computation and operator fusion to achieve high performance.
use luminal::prelude::*;
// Setup graph and tensors
let mut cx = Graph::new();
let a = cx.new_tensor::<R2<3, 1>>("A");
let b = cx.new_tensor::<R2<1, 4>>("B");
// Do stuff...
let c = a.matmul(b);
// Set inputs and mark outputs
a.set(vec![1.0, 2.0, 3.0]);
b.set(vec![1.0, 2.0, 3.0, 3.0]);
c.mark();
// Optimize and run graph
cx.optimize(GenericOptimizer::default());
cx.execute();
// Get result
println!("Result: {:?}", c.retrieve().unwrap().data);
Most deep learning libraries are eager-first, meaning each op call directly operates on the data. So when you see x + y
, the addition actually happens right there. This is great for debugging, it works exactly as most developers expect.
However, this isn't great for performance because what makes sense for a developer doesn't make sense for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!
Luminal takes a different approach, more similar to XLA, and tinygrad. Here everything's static. When you write out an expression like x + y
, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute()
is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as one or a few static computation graphs, and executed later.
A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, our optimizers have global knowledge and can do much more aggressive optimization without any sync points.
Of course, we can still split the network into multiple seperate graphs if we want to insert dynamic control flow part-way through, which means this method doesn't preclude optimizations like KV caching, because the KV cached forward pass is just a seperate graph!
Some huge benefits are now unlocked:
- Aggressive kernel fusion
- Shape-specific kernels compiled at runtime
- Devices and Dtypes are handled through optimizers (just run the CUDA optimizer to convert the graph to use CUDA kernels, then the fp16 optimizer to convert to half-precision kernels)
- Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)
Luminal can be ran on new accelerators by implementing 11 primitive ops. Take a look at src/optimizers/cuda/prim.rs
to see 1-to-1 CUDA translations of the primops.
Accellerators are free to implement their own custom ops, and their own optimizers to convert luminal primitive ops to their bespoke ops.
All operations are shape checked at compile time, so no more shape mismatches! All credit for this goes to dfdx.
Once you've written all your computation code, run cx.display_graph()
to see the entire computation graph in all it's glory. Pretty messy looking! Now run cx.optimize(GeneralOptimizer::default())
and display th
80F2
e graph again. Much better.
Currently luminal is extremely alpha. Please don't use this in prod.
- Llama 1 is implemented in
examples/llama
. You'll need to follow the instructions in llama-dfdx to download and convert the llama weights, and point this example loading path at them. - The llama example shows how to implement a loader for a custom format. Safetensors loaders are already implemented, and are the recommended way to load a model.
- We have a small library of NN modules in
nn
, including transformers. - A signifigant amount of high-level ops are implemented in
hl_ops
. We are aiming to match the tinygrad ops set. - Currently there are very few optimizers, so primops are mostly used to run these models, which are very slow.
- Next release will bring a signifigant amount of optimizers which should fuse primops into much faster ops. The aim for 0.2 is to be usably fast, not SOTA yet.
Some things on the roadmap:
- Write common sense cuda ops and optimizer (matmuls, mul-add, etc.)
- Build benchmarking suite to test against other libs
- Write specialized CUDA kernels for full transformer architecture (FlashAttention, etc.)
- Automatic differentiation of graphs
- Beat PT 2.0 perf on LLM training
- Build dyson swarm