This repository is a chat example with LLaMA (arXiv) models running on a typical home PC. You will just need a NVIDIA videocard and some RAM to chat with model.
By using HF version you may fine-tune the model to any desired task.
This repo is heavily based on Meta's original repo: https://github.com/facebookresearch/llama
And on Steve Manuatu's repo: https://github.com/venuatu/llama
And on Shawn Presser's repo: https://github.com/shawwn/llama
HF π€ version by Yam Peleg and Jason Phang: https://github.com/ypeleg/llama & https://github.com/zphang
Share your best prompts, chats or generations here in this issue: randaller#7
- Modern enough CPU
- NVIDIA graphics card (2 Gb of VRAM is ok); HF version is able to run on CPU, or mixed CPU/GPU, or pure GPU
- 64 or better 128 Gb of RAM (192 would be perfect for 65B model)
One may run with 32 Gb of RAM, but inference will be slow (with the speed of your swap file reading)
I am running PyArrow version on a [12700k/128 Gb RAM/NVIDIA 3070ti 8Gb/fast huge nvme with 256 Gb swap for 65B model] and getting one token from 30B model in a few seconds.
For example, PyArrow 30B model uses around 70 Gb of RAM. 7B model fits into 18 Gb. 13B model uses 48 Gb.
If you do not have nvidia videocard, you may use another repo for cpu-only inference: https://github.com/randaller/llama-cpu or HF π€ version.
< 8000 div class="markdown-heading" dir="auto">