8000 GitHub - gugas1nwork/cortex: Drop-in, local AI alternative to the OpenAI stack. Multi-engine (llama.cpp, TensorRT-LLM, ONNX). Powers πŸ‘‹ Jan
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Drop-in, local AI alternative to the OpenAI stack. Multi-engine (llama.cpp, TensorRT-LLM, ONNX). Powers πŸ‘‹ Jan

License

Notifications You must be signed in to change notification settings

gugas1nwork/cortex

Β 
Β 

Repository files navigation

Cortex

cortex-cpplogo

Documentation - API Reference - Changelog - Bug reports - Discord

⚠️ Cortex is currently in Development: Expect breaking changes and bugs!

About

Cortex is a C++ AI engine that comes with a Docker-like command-line interface and client libraries. It supports running AI models using ONNX, TensorRT-LLM, and llama.cpp engines. Cortex can function as a standalone server or be integrated as a library.

Cortex Engines

Cortex supports the following engines:

  • cortex.llamacpp: cortex.llamacpp library is a C++ inference tool that can be dynamically loaded by any server at runtime. We use this engine to support GGUF inference with GGUF models. The llama.cpp is optimized for performance on both CPU and GPU.
  • cortex.onnx Repository: cortex.onnx is a C++ inference library for Windows that leverages onnxruntime-genai and uses DirectML to provide GPU acceleration across a wide range of hardware and drivers, including AMD, Intel, NVIDIA, and Qualcomm GPUs.
  • cortex.tensorrt-llm: cortex.tensorrt-llm is a C++ inference library designed for NVIDIA GPUs. It incorporates NVIDIA’s TensorRT-LLM for GPU-accelerated inference.

Installation

MacOs

brew install cortex-engine

Windows

winget install cortex-engine

Linux

sudo apt install cortex-engine

Docker

Coming Soon!

Libraries

Build from Source

To install Cortex from the source, follow the steps below:

  1. Clone the Cortex repository here.
  2. Navigate to the cortex-js folder.
  3. Open the terminal and run the following command to build the Cortex project:
npx nest build
  1. Make the command.js executable:
chmod +x '[path-to]/cortex/cortex-js/dist/src/command.js'
  1. Link the package globally:
npm link

Quickstart

To run and chat with a model in Cortex:

# Start the Cortex server
cortex

# Start a model
cortex run [model_id]

# Chat with a model
cortex chat [model_id]

Model Library

Cortex supports a list of models available on Cortex Hub.

Here are example of models that you can use based on each supported engine:

llama.cpp

Model ID Variant (Branch) Model size CLI command
codestral 22b-gguf 22B cortex run codestral:22b-gguf
command-r 35b-gguf 35B cortex run command-r:35b-gguf
gemma 7b-gguf 7B cortex run gemma:7b-gguf
llama3 gguf 8B cortex run llama3:gguf
llama3.1 gguf 8B cortex run llama3.1:gguf
mistral 7b-gguf 7B cortex run mistral:7b-gguf
mixtral 7x8b-gguf 46.7B cortex run mixtral:7x8b-gguf
openhermes-2.5 7b-gguf 7B cortex run openhermes-2.5:7b-gguf
phi3 medium-gguf 14B - 4k ctx len cortex run phi3:medium-gguf
phi3 mini-gguf 3.82B - 4k ctx len cortex run phi3:mini-gguf
qwen2 7b-gguf 7B cortex run qwen2:7b-gguf
tinyllama 1b-gguf 1.1B cortex run tinyllama:1b-gguf

ONNX

Model ID Variant (Branch) Model size CLI command
gemma 7b-onnx 7B cortex run gemma:7b-onnx
llama3 onnx 8B cortex run llama3:onnx
mistral 7b-onnx 7B cortex run mistral:7b-onnx
openhermes-2.5 7b-onnx 7B cortex run openhermes-2.5:7b-onnx
phi3 mini-onnx 3.82B - 4k ctx len cortex run phi3:mini-onnx
phi3 medium-onnx 14B - 4k ctx len cortex run phi3:medium-onnx

TensorRT-LLM

Model ID Variant (Branch) Model size CLI command
llama3 8b-tensorrt-llm-windows-ampere 8B cortex run llama3:8b-tensorrt-llm-windows-ampere
llama3 8b-tensorrt-llm-linux-ampere 8B cortex run llama3:8b-tensorrt-llm-linux-ampere
llama3 8b-tensorrt-llm-linux-ada 8B cortex run llama3:8b-tensorrt-llm-linux-ada
llama3 8b-tensorrt-llm-windows-ada 8B cortex run llama3:8b-tensorrt-llm-windows-ada
mistral 7b-tensorrt-llm-linux-ampere 7B cortex run mistral:7b-tensorrt-llm-linux-ampere
mistral 7b-tensorrt-llm-windows-ampere 7B cortex run mistral:7b-tensorrt-llm-windows-ampere
mistral 7b-tensorrt-llm-linux-ada 7B cortex run mistral:7b-tensorrt-llm-linux-ada
mistral 7b-tensorrt-llm-windows-ada 7B cortex run mistral:7b-tensorrt-llm-windows-ada
openhermes-2.5 7b-tensorrt-llm-windows-ampere 7B cortex run openhermes-2.5:7b-tensorrt-llm-windows-ampere
openhermes-2.5 7b-tensorrt-llm-windows-ada 7B cortex run openhermes-2.5:7b-tensorrt-llm-windows-ada
openhermes-2.5 7b-tensorrt-llm-linux-ada 7B cortex run openhermes-2.5:7b-tensorrt-llm-linux-ada

Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 14B models, and 32 GB to run the 32B models.

Cortex CLI Commands

Note: For a more detailed CLI Reference documentation, please see here.

Start Cortex Server

cortex 

Chat with a Model

cortex chat [options] [model_id] [message]

Embeddings

cortex embeddings [options] [model_id] [message]

Pull a Model

cortex pull <model_id>

This command can also pulls Hugging Face's models.

Download and Start a Model

cortex run [options] [model_id]:[engine]

Get a Model Details

cortex models get <model_id>

List Models

cortex models list [options]

Remove a Model

cortex models remove <model_id>

Start a Model

cortex models start [model_id]

Stop a Model

cortex models stop <model_id>

Update a Model Config

cortex models update [options] <model_id>

Get an Engine Details

cortex engines get <engine_name>

Install an Engine

cortex engines install <engine_name> [options]

List Engines

cortex engines list [options]

Set an Engine Config

cortex engines set <engine_name> <config> <value>

Show Model Information

cortex ps

REST API

Cortex has a REST API that runs at localhost:1337.

Pull a Model

curl --request POST \
  --url http://localhost:1337/v1/models/{model_id}/pull

Start a Model

curl --request POST \
  --url http://localhost:1337/v1/models/{model_id}/start \
  --header 'Content-Type: application/json' \
  --data '{
  "prompt_template": "system\n{system_message}\nuser\n{prompt}\nassistant",
  "stop": [],
  "ngl": 4096,
  "ctx_len": 4096,
  "cpu_threads": 10,
  "n_batch": 2048,
  "caching_enabled": true,
  "grp_attn_n": 1,
  "grp_attn_w": 512,
  "mlock": false,
  "flash_attn": true,
  "cache_type": "f16",
  "use_mmap": true,
  "engine": "cortex.llamacpp"
}'

Chat with a Model

curl http://localhost:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    },
  ],
  "model": "mistral",
  "stream": true,
  "max_tokens": 1,
  "stop": [
      null
  ],
  "frequency_penalty": 1,
  "presence_penalty": 1,
  "temperature": 1,
  "top_p": 1
}'

Stop a Model

curl --request POST \
  --url http://localhost:1337/v1/models/mistral/stop

Note: Check our API documentation for a full list of available endpoints.

Contact Support

About

Drop-in, local AI alternative to the OpenAI stack. Multi-engine (llama.cpp, TensorRT-LLM, ONNX). Powers πŸ‘‹ Jan

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 62.3%
  • TypeScript 21.8%
  • C 13.7%
  • CMake 0.5%
  • Shell 0.4%
  • Makefile 0.3%
  • Other 1.0%
0