Stars
Context7 MCP Server -- Up-to-date code documentation for LLMs and AI code editors
CUDA Matrix Multiplication Optimization
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
Proper implementation of ResNet-s for CIFAR10/100 in pytorch that matches description of the original paper.
Findpapers: A tool for helping researchers who are looking for related works
FUE5 is a fan-made project with the goal to see what would Factorio look like and behave in 3D. This project has no affiliation with the official Factorio game.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
A collection of research papers on efficient training of DNNs
PyTorch emulation library for Microscaling (MX)-compatible data formats
synthesiseable ieee 754 floating point library in verilog
Development repository for the Triton language and compiler
Implementation of Transformer Model in Tensorflow
A simple high performance CUDA GEMM implementation.
YoloV3 Implemented in Tensorflow 2.0
tfyolo: Efficient Implementation of Yolov5 in TensorFlow
transformer in tensorflow 2.0
This is a fast and concise implementation of Faster R-CNN with TensorFlow2.
📝 Some source code about matrix multiplication implementation on CUDA
Curated content for DNN approximation, acceleration ... with a focus on hardware accelerator and deployment