Matrix: Multi-Agent daTa geneRation Infra and eXperimentation

Fast, scalable, and easy-to-use LLM-generation engine

Latest News

04/2025: 🔥 We officially released Matrix with Collaborative Reasoner, showcasing the generation of multi-agent collaborative conversation with Matrix as inference engine.

About

Matrix is a library for fast, scalable, and easy-to-use LLM-generation engine, for use cases including model benchmarking, data processing, and data generation.

Matrix runs on top of a Ray cluster for scalability. Cluster resources are acquired from Slurm or local through submitit. Matrix has following main features:

Large scale inference for maintstream opensourced and proprietary LLMs

Hugging Face LLMs via seamless intergration with vLLM and SGLang. Native multi-node inference support.
Azure OpenAI, SageMaker, Gemini models with Proxy server

Data piplines of high-throughput data processing and quality check

Code execution service as a wrapper of bubblewrap.
Data curation, quality filtering, and augmentation with classifiers.

Matrix vs. Existing Frameworks

Matrix is designed for scalable LLM inference on Slurm. Here is a feature comparison with other popular LLM inference solutions.

Serving Frameworks	Slurm	vLLM	HTTP	gRPC	Auto-scaling	Open-source
vector-inference	✓	✓	✓	✗	✗	✓
litellm	✗	✓	✓	✗	✗	✓
ollama	✗	✗	✓	✗	✗	✓
SageMaker	✗	✓	✓	✗	✓	✗
llm-swarm	✓	✓	✓	✗	✗	✓
Matrix	✓	✓	✓	✓	✓	✓

Getting Started

Conda Environment

conda create --name matrix python=3.10
conda activate matrix
pip install fair-matrix[vllm_083]

Launch ray cluster

matrix start_cluster --add_workers 1 --slurm "{'account': $SLURM_ACCOUNT, 'qos': $SLURM_QOS}"

Deploy Model

// login to access huggingface hub
huggingface-cli login

matrix deploy_applications --applications "[{'model_name': 'meta-llama/Llama-3.1-8B-Instruct', 'min_replica': 8, 'name': '8B'}]"

LLM Inference

matrix check_health --app_name 8B

Shudown ray cluster

matrix stop_cluster

Advanced Deployment

Enable Grafana Dashboard

Install in conda

bash ./matrix/scripts/install_prometheus_and_grafana.sh

Enable in Ray Dashboard

matrix start_cluster --enable_grafana

Incremental Deployment

Add More Workers

matrix start_cluster --add_workers 4 --slurm "{'account': $SLURM_ACCOUNT, 'qos': $SLURM_QOS}"

Add/Remove Applications

matrix deploy_applications --action add --applications "[{'model_name': 'meta-llama/Llama-3.1-405B-Instruct', 'min_replica': 2, 'name': '405B'}]"

Remove All Applications

matrix deploy_applications --applications ''

Adjust Model Args

vLLM Engine Aruments can be specified in the deploy_applications arguments. The default values for popular models are in llm_config.py. Other useful args

model_name: a huggingface model name or a directory containing checkpoints.
name: the default app_name.
model_size: template to apply when model is from a directory, such as 8B, 70B, 405B etc, templates are from the llm_config.py file.
max_ongoing_requests: the max concurrent requests to each replica.
min_replia and max_replica: the num of replicas ranges auto-scaled based on num of Ray workers.
use_grpc: enable grpc by adding {'use_grpc': 'true'}.

OpenAI Azure Model

Note: no GPU is required, in start_workers, can add --slurm "{'gpus_per_node': 0}"

matrix deploy_applications --applications "[{'api_version': \"$AZURE_API_VERSION\", 'api_endpoint': \"$AZURE_ENDPOINT\", 'api_key': \"$AZURE_API_KEY\", 'app_type': 'openai', 'model_name': 'gpt-4o', 'name': 'openai'}]"

Gemini

Note: no GPU is required, in start_workers, can add --slurm "{'gpus_per_node': 0}"

matrix deploy_applications --applications "[{'app_type': 'gemini', 'name': "gemini", 'api_key': \"$GOOGLE_API_KEY\",  'model_name': 'gemini-2.0-flash'}]"

Deepseek R1

vLLM >=0.8.3 supports DS R1. An alternative backend is sglang.

// install sglang
pip install fair-matrix[sglang_045]

matrix deploy_applications --applications "[{'model_name': 'deepseek-ai/DeepSeek-R1', 'pipeline-parallel-size': 2, 'app_type': sglang_llm, 'name': 'r1'}]"

Llama 4

matrix deploy_applications --applications "[{'model_name': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'name': 'scout'}]"

matrix deploy_applications --applications "[{'model_name': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'name': 'maverick'}]"

LLM Inference

Batch Query

// download math-500 dataset
python -m matrix.scripts.hf_dataset_to_jsonl HuggingFaceH4/MATH-500 test test.jsonl

// query math-500
matrix inference --app_name maverick-fp8 --input_jsonls test.jsonl --output_jsonl response.jsonl --batch_size=64 \
  --system_prompt "Please reason step by step, and put your final answer within \boxed{}." --max_tokens 30000 --text_key problem --timeout_secs 1800

Input Format

There are two formats for the jsonl input files:

Message format with arg --messages_key request.messages

{
    "request": {"messages": [{"role": "system","content": "You are ..."},{"role": "user","content": "Solve the following..."}]}
}

Instruct format with arg --text_key text

{
    "text": "<|start_header_id|>system<|end_header_id|>You are ... <|eot_id|><|start_header_id|>user<|end_header_id|>Solve the following ...<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
}

Raw tax with arg --text_key text

{
    "text": "Solve the following ..."
}

Inference API

from matrix import Cli
from matrix.client import query_llm

metadata = Cli().get_app_metadata(app_name="8B")

# async call
await query_llm.make_request(
  url=metadata["endpoints"]["head"],
  model=metadata["model_name"],
  app_name=metadata["name"],
  data={"messages": [{"role": "user", "content": "hi"}]},
))

# batch inference
query_llm.batch_requests(
  url=metadata["endpoints"]["head"],
  model=metadata["model_name"],
  app_name=metadata["name"],
  requests=[{"messages": [{"role": "user", "content": "hi"}]}],
)

Job manager

Job manager allows users to submit tasks for distributed execution on Ray. More details are in here.

Data piplines

Code Execution

Install bubblewrap

conda install -c conda-forge bubblewrap

Run example python code

matrix deploy_applications --applications "[{'name': 'code', 'app_type': code, 'min_replica': 5}]"
matrix check_health --app_name code

python -m -m matrix.scripts.hf_dataset_to_jsonl openai/openai_humaneval test humaneval/test.jsonl
matrix inference code ~/tmp/he.jsonl humaneval/test.jsonl --text_keys "[prompt, canonical_solution, test, entry_point]" --prompt_template "check({entry_point})"

Data filtering and augmentation

minhash dedup

python  -m matrix.data_pipeline.quality.dedup_minhash $ray_head:$client_server_port input.jsonl output_dir working_dir --text_key problem

multilabel classification

python -m matrix.data_pipeline.classification.multi_label_classification $ray_head:$client_server_port  \
  cardiffnlp/twitter-roberta-base-emotion-multilabel-latest input.jsonl output_dir \
  --num_gpus 8 --text_key question --threshold_fname ""

Offline batch inference

python -m matrix.data_pipeline.generate.vllm_generate $ray_head:$client_server_port ./math-500/test.jsonl math-500/response  \
  --prompt_template "<|start_header_id|>system<|end_header_id|>\n\nPlease reason step by step, and put your final answer within \boxed{}.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<user_message><|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" \
  --model_args "{'model': 'meta-llama/Llama-3.3-70B-Instruct', 'seed': 42, 'max_model_len': 20480, 'tensor_parallel_size': 4}" \
  --sampling_params "{'n': 1, 'temperature': 0.6, 'top_p': 0.95, 'max_tokens': 16384}" \
  --min_concurrency 32 --output_key pred --batch_size=32

Contributing

We always welcome contributions to matrix! Please refer to Contribution Guidelines to learn how to format, test, and submit your work. If you have any questions related to the code, feel free to email Dong Wang (dongwang@meta.com) or Daniel Li (shangwel@meta.com).

Citation

If you use matrix in your research and wish to refer to it, please use the following BibTeX entry.

@software{matrix2025,
  author = {Dong Wang and Yang Li and Ansong Ni and Youssef Emad and Xinjie Lei and Ruta Desai and Karthik Padthe and Xian Li and Asli Celikyilmaz and Ramya Raghavendra and Leo Huang and Daniel Li},
  title = {Matrix: Multi-Agent daTa geneRation Infra and eXperimentation},
  url = {http://github.com/facebookresearch/matrix},
  year = {2025},
}

License

This project is MIT licensed, as found in the LICENSE file.

Acknowledgement

We gratefully acknowledge the Ray and vLLM team for initial Ray Serve integration with vLLM.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
docs		docs
matrix		matrix
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

License

Licenses found

facebookresearch/matrix

Folders and files

Latest commit

History

Repository files navigation

Matrix: Multi-Agent daTa geneRation Infra and eXperimentation

Fast, scalable, and easy-to-use LLM-generation engine

About

Matrix vs. Existing Frameworks

Quick Links

Getting Started

Advanced Deployment

Enable Grafana Dashboard

Incremental Deployment

Adjust Model Args

OpenAI Azure Model

Gemini

Deepseek R1

Llama 4

LLM Inference

Batch Query

Input Format

Inference API

Job manager

Data piplines

Code Execution

Data filtering and augmentation

Contributing

Citation

License

Acknowledgement

About

Resources

License

Licenses found

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages