8000 GitHub - jafioti/luminal at 0.1
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

jafioti/luminal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

luminal

image Deep learning at the speed of light.

Luminal is a deep learning library that prioritizes static computation and operator fusion to achieve high performance.

use luminal::prelude::*;

// Setup graph and tensors
let mut cx = Graph::new();
let a = cx.new_tensor::<R2<3, 1>>("A");
let b = cx.new_tensor::<R2<1, 4>>("B");

// Do stuff...
let c = a.matmul(b);

// Set inputs and mark outputs
a.set(vec![1.0, 2.0, 3.0]);
b.set(vec![1.0, 2.0, 3.0, 3.0]);
c.mark();

// Optimize and run graph
cx.optimize(GenericOptimizer::default());
cx.execute();

// Get result
println!("Result: {:?}", c.retrieve().unwrap().data);

Why does this look so different from other DL libraries?

Most deep learning libraries are eager-first, meaning each op call directly operates on the data. So when you see x + y, the addition actually happens right there. This is great for debugging, it works exactly as most developers expect.

However, this isn't great for performance because what makes sense for a developer doesn't make sense for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!

Luminal takes a different approach, more similar to XLA, and tinygrad. Here everything's static. When you write out an expression like x + y, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute() is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as one or a few static computation graphs, and executed later.

But Why?

A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, our optimizers have global knowledge and can do much more aggressive optimization without any sync points.

Of course, we can still split the network into multiple seperate graphs if we want to insert dynamic control flow part-way through, which means this method doesn't preclude optimizations like KV caching, because the KV cached forward pass is just a seperate graph!

Some huge benefits are now unlocked:

  • Aggressive kernel fusion
  • Shape-specific kernels compiled at runtime
  • Devices and Dtypes are handled through optimizers (just run the CUDA optimizer to convert the graph to use CUDA kernels, then the fp16 optimizer to convert to half-precision kernels)
  • Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)

RISC-style architecture

Luminal can be ran on new accelerators by implementing 11 primitive ops. Take a look at src/optimizers/cuda/prim.rs to see 1-to-1 CUDA translations of the primops.

Accellerators are free to implement their own custom ops, and their own optimizers to convert luminal primitive ops to their bespoke ops.

Compile-time Shape Checks

All operations are shape checked at compile time, so no more shape mismatches! All credit for this goes to dfdx.

View the Graph

Once you've written all your computation code, run cx.display_graph() to see the entire computation graph in all it's glory. Pretty messy looking! Now run cx.optimize(GeneralOptimizer::default()) and display th 80F2 e graph again. Much better.

Where are we?

Currently luminal is extremely alpha. Please don't use this in prod.

  • Llama 1 is implemented in examples/llama. You'll need to follow the instructions in llama-dfdx to download and convert the llama weights, and point this example loading path at them.
  • The llama example shows how to implement a loader for a custom format. Safetensors loaders are already implemented, and are the recommended way to load a model.
  • We have a small library of NN modules in nn, including transformers.
  • A signifigant amount of high-level ops are implemented in hl_ops. We are aiming to match the tinygrad ops set.
  • Currently there are very few optimizers, so primops are mostly used to run these models, which are very slow.
  • Next release will bring a signifigant amount of optimizers which should fuse primops into much faster ops. The aim for 0.2 is to be usably fast, not SOTA yet.

Some things on the roadmap:

  • Write common sense cuda ops and optimizer (matmuls, mul-add, etc.)
  • Build benchmarking suite to test against other libs
  • Write specialized CUDA kernels for full transformer architecture (FlashAttention, etc.)
  • Automatic differentiation of graphs
  • Beat PT 2.0 perf on LLM training
  • Build dyson swarm

About

Deep learning at the speed of light.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Contributors 11

0