ChatGLM.cpp

C++ implementation of ChatGLM-6B, ChatGLM2-6B, ChatGLM3-6B and more LLMs for real-time chatting on your MacBook.

Features

Highlights:

Pure C++ implementation based on ggml, working in the same way as llama.cpp.
Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing.
Streaming generation with typewriter effect.
Python binding, web demo, api servers and more possibilities.

Support Matrix:

Hardwares: x86/arm CPU, NVIDIA GPU, Apple Silicon GPU
Platforms: Linux, MacOS, Windows
Models: ChatGLM-6B, ChatGLM2-6B, ChatGLM3-6B, CodeGeeX2, Baichuan-13B, Baichuan-7B, Baichuan-13B, Baichuan2, InternLM

Getting Started

Preparation

Clone the ChatGLM.cpp repository into your local machine:

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatglm.cpp folder:

git submodule update --init --recursive

Quantize Model

Install necessary packages for loading and quantizing Hugging Face models:

python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece

Use convert.py to transform ChatGLM-6B into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:

python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.bin

The original model (-i <model_name_or_path>) can be a Hugging Face model name or a local path to your pre-downloaded model. Currently supported models are:

ChatGLM-6B: THUDM/chatglm-6b, THUDM/chatglm-6b-int8, THUDM/chatglm-6b-int4
ChatGLM2-6B: THUDM/chatglm2-6b, THUDM/chatglm2-6b-int4
ChatGLM3-6B: THUDM/chatglm3-6b
CodeGeeX2: THUDM/codegeex2-6b, THUDM/codegeex2-6b-int4
Baichuan & Baichuan2: baichuan-inc/Baichuan-13B-Chat, baichuan-inc/Baichuan2-7B-Chat, baichuan-inc/Baichuan2-13B-Chat

You are free to try any of the below quantization types by specifying -t <type>:

q4_0: 4-bit integer quantization with fp16 scales.
q4_1: 4-bit integer quantization with fp16 scales and minimum values.
q5_0: 5-bit integer quantization with fp16 scales.
q5_1: 5-bit integer quantization with fp16 scales and minimum values.
q8_0: 8-bit integer quantization with fp16 scales.
f16: half precision floating point weights without quantization.
f32: single precision floating point weights without quantization.

For LoRA model, add -l <lora_model_name_or_path> flag to merge your LoRA weights into the base model.

Build & Run

Compile the project using CMake:

cmake -B build
cmake --build build -j --config Release

Now you may chat with the quantized ChatGLM-6B model by running:

./build/bin/main -m chatglm-ggml.bin -p 你好
# 你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。

To run the model in interactive mode, add the -i flag. For example:

./build/bin/main -m chatglm-ggml.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Try Other Models

ChatGLM2-6B

python3 chatglm_cpp/convert.py -i THUDM/chatglm2-6b -t q4_0 -o chatglm2-ggml.bin
./build/bin/main -m chatglm2-ggml.bin -p 你好 --top_p 0.8 --temp 0.8
# 你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。

ChatGLM3-6B

ChatGLM3-6B further supports function call and code interpreter in addition to chat mode.

Chat mode:

python3 chatglm_cpp/convert.py -i THUDM/chatglm3-6b -t q4_0 -o chatglm3-ggml.bin
./build/bin/main -m chatglm3-ggml.bin -p 你好 --top_p 0.8 --temp 0.8
# 你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。

Setting system prompt:

./build/bin/main -m chatglm3-ggml.bin -p 你好 -s "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown."
# 你好👋！我是 ChatGLM3，有什么问题可以帮您解答吗？

Function call:

$ ./build/bin/main -m chatglm3-ggml.bin --top_p 0.8 --temp 0.8 --sp examples/system/function_call.txt -i
System   > Answer the following questions as best as you can. You have access to the following tools: ...
Prompt   > 生成一个随机数
ChatGLM3 > random_number_generator
```python
tool_call(seed=42, range=(0, 100))
```
Tool Call   > Please manually call function `random_number_generator` with args `tool_call(seed=42, range=(0, 100))` and provide the results below.
Observation > 23
ChatGLM3 > 根据您的要求，我使用随机数生成器API生成了一个随机数。根据API返回结果，生成的随机数为23。

Code interpreter:

$ ./build/bin/main -m chatglm3-ggml.bin --top_p 0.8 --temp 0.8 --sp examples/system/code_interpreter.txt -i
System   > 你是一位智能AI助手，你叫ChatGLM，你连接着一台电脑，但请注意不能联网。在使用Python解决任务时，你可以运行代码并得到结果，如果运行结果有错误，你需要尽可能对代码进行改进。你可以处理用户上传到电脑上的文件，文件默认存储路径是/mnt/data/。
Prompt   > 列出100以内的所有质数
ChatGLM3 > 好的，我会为您列出100以内的所有质数。
```python
def is_prime(n):
   """Check if a number is prime."""
   if n <= 1:
       return False
   if n <= 3:
       return True
   if n % 2 == 0 or n % 3 == 0:
       return False
   i = 5
   while i * i <= n:
       if n % i == 0 or n % (i + 2) == 0:
           return False
       i += 6
   return True

primes_upto_100 = [i for i in range(2, 101) if is_prime(i)]
primes_upto_100
```

Code Interpreter > Please manually run the code and provide the results below.
Observation      > [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
ChatGLM3 > 100以内的所有质数为：

$$
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 
$$

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
chatglm_cpp		chatglm_cpp
docs		docs
examples		examples
tests		tests
third_party		third_party
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
chatglm.cpp		chatglm.cpp
chatglm.h		chatglm.h
chatglm_pybind.cpp		chatglm_pybind.cpp
chatglm_test.cpp		chatglm_test.cpp
main.cpp		main.cpp
pyproject.toml		pyproject.toml
setup.py		setup.py

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16
ms/token (CPU @ Platinum 8260)	74	77	86	89	114	189
ms/token (CUDA @ V100 SXM2)	8.1	8.7	9.4	9.5	12.0	19.1
ms/token (MPS @ M2 Ultra)	11.5	12.3	N/A	N/A	16.1	24.4
file size	3.3G	3.7G	4.0G	4.4G	6.2G	12G
mem usage	4.0G	4.4G	4.7G	5.1G	6.9G	13G

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16
ms/token (CPU @ Platinum 8260)	64	71	79	83	106	189
ms/token (CUDA @ V100 SXM2)	7.9	8.3	9.2	9.2	11.7	18.5
ms/token (MPS @ M2 Ultra)	10.0	10.8	N/A	N/A	14.5	22.2
file size	3.3G	3.7G	4.0G	4.4G	6.2G	12G
mem usage	3.4G	3.8G	4.1G	4.5G	6.2G	12G

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16
ms/token (CPU @ Platinum 8260)	85.3	94.8	103.4	109.6	136.8	248.5
ms/token (CUDA @ V100 SXM2)	8.7	9.2	10.2	10.3	13.2	21.0
ms/token (MPS @ M2 Ultra)	11.3	12.0	N/A	N/A	16.4	25.6
file size	4.0G	4.4G	4.9G	5.3G	7.5G	14G
mem usage	4.5G	4.9G	5.3G	5.7G	7.8G	14G

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16
ms/token (CPU @ Platinum 8260)	161.7	175.8	189.9	192.3	255.6	459.6
ms/token (CUDA @ V100 SXM2)	13.7	15.1	16.3	16.9	21.9	36.8
ms/token (MPS @ M2 Ultra)	18.2	18.8	N/A	N/A	27.2	44.4
file size	7.0G	7.8G	8.5G	9.3G	14G	25G
mem usage	7.8G	8.8G	9.5G	10G	14G	25G

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChatGLM.cpp

Features

Getting Started

Using BLAS

Python Binding

API Server

Using Docker

Performance

Development

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16
ms/token (CPU @ Platinum 8260)	85.3	90.1	103.5	112.5	137.3	232.2
ms/token (CUDA @ V100 SXM2)	9.1	9.4	10.5	10.5	13.3	21.1

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16
ms/token (CPU @ Platinum 8260)	230.0	236.7	276.6	290.6	357.1	N/A
ms/token (CUDA @ V100 SXM2)	21.6	23.2	25.0	25.9	33.4	N/A

License

cikado/chatglm.cpp

Folders and files

Latest commit

History

Repository files navigation

ChatGLM.cpp

Features

Getting Started

Using BLAS

Python Binding

API Server

Using Docker

Performance

Development

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages