Hermes is a public, open-source evaluation framework implementing the methodology described in:
Hermes is an algorithm-system co-design framework that intelligently distributes search clusters across multiple machines, employing hierarchical search and Dynamic Voltage and Frequency Scaling (DVFS) to optimize retrieval latency and energy consumption. Hermes is built on open-source LLMs and retrieval indices from publicly available datasets and running on commodity hardware:
📖 Read our full paper: here
📈 Explore our profiled inference and retrieval data with RAGCAT: here
🔗 If you use Hermes or RAGCAT in your research, please cite us:
@inproceedings{shen2025hermes,
title={Hermes: Algorithm-System Co-design for Efficient Retrieval-Augmented Generation At-Scale},
author={Shen, Michael and Umar, Muhammad and Maeng, Kiwan and Suh, G Edward and Gupta, Udit},
booktitle={Proceedings of the 52nd Annual International Symposium on Computer Architecture},
pages={958--973},
year={2025}
}
Hermes leverages publicly available datasets:
📂 Datasets
- SPHERE_899M – BERT Encoded 899M Subset of Common Crawl
- SPHERE_100M – BERT Encoded 100M Subset of Common Crawl
- SPHERE_100K – BERT Encoded 100K Subset of Common Crawl
- TriviaQA – Open-domain QA dataset
🧠Models
- Environment Setup
- Create Monolithic, Split, and Hermes Cluster Search Indices
- Profile Search Latencies, and Power
- Profile Latency and Power of Encoding & Inference
- Generate Cluster Access Traces
- Run Multi-Node Aggregation Analysis
- Accuracy and DVFS Evaluation + Figures
🚀 Quick Scripts for Automated Building, Profiling, and Data Collection on Hermes:
build.sh
: Build Flat, Monolithic, Clustered, and Split Retrieval Indices with 100K datastoreprofile.sh
: Profile Latency and Power of 100K Monolithic, Clustered, and Split Retrieval Index Latencies, Also Profiles SOTA Encoder and Inference Model Latency and Powereval.sh
: Models Hermes Latency and Energy Usage for Retrieval and runs hermes accuracy analysis scriptsisca_figures.sh
: Using the given data in 100m_data to produce the figures in the isca paper
Artifact Evaluation Instructions
To reproduce the figures from our ISCA paper, please follow these steps:
- Setup the Environment
- Run
build.sh
to build some indices for profiling. - Execute
isca_figures.sh
to generate figures from the paper.
The other workflows are intended for users who wish to build, profile, and evaluate their own indices.
You can set up the environment for evaluating Hermes either by building a Docker image or by installing the required packages directly on your native Linux system. Note: For DVFS and energy analysis, the Docker image must be run in a priviledged state and with write access granted to sys.
-
Pull Docker Image
sudo docker pull michaeltshen/hermes-env:latest
-
Run Docker Container with GPU Support
sudo docker run --gpus all --privileged -v /sys:/sys -it michaeltshen/hermes-env:latest
-
Pull the Latest Code from Git
git pull
Alternatively you can build the docker image with the included Dockerfile file
-
Clone Repository:
git clone --recurse-submodules https://github.com/Michaeltshen/Hermes.git cd Hermes
-
Download Encoded TriviaQA Queries:
triviaqa_encodings.npy
→triviaqa/
-
Build Docker Image
Run the following from the base of the repo:
sudo docker build -t hermes-env .
-
Run Docker Container with GPU Support
sudo docker run --gpus all --privileged -v /sys:/sys -it hermes-env
-
Create a Conda Environment:
conda create -n hermes python=3.11 conda activate hermes
-
Clone Repository:
git clone --recurse-submodules https://github.com/Michaeltshen/Hermes.git cd Hermes
-
Download Encoded TriviaQA Queries:
triviaqa_encodings.npy
→triviaqa/
-
Install Dependencies:
conda install -c pytorch -c nvidia faiss-gpu=1.8.0 pytorch=*=*cuda* pytorch-cuda=11 numpy conda install -c conda-forge gcc_linux-64 gxx_linux-64 pip install transformers vllm datasets pynvml matplotlib pyRAPL pymongo
-
Torchvision Dependency Corrections:
If you encounter CUDA version mismatches between PyTorch and torchvision, run the following command to automatically detect and install the correct torchvision version for your setup:
source setup/torchvision_version_fix.sh
This script ensures that torchvision matches your installed PyTorch version and CUDA compatibility, preventing runtime errors.
-
Make RAPL Files Readable:
sudo chmod -R a+r /sys/class/powercap/intel-rapl/ sudo chmod a+r /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
-
Build Rapl-Read:
cd uarch-configure/rapl-read make
Building large-scale indices (e.g., 1B–10B vectors) can take days or weeks. These indices can also reach hundreds of gigabytes in size. Choose an approach based on your requirements:
🔹 Monolithic Index
python index/create_monolithic_index.py --index-size 100K
🔹 Evenly Split Indices
python index/create_split_indices.py --dataset-size 100k --num-indices 10
🔹 Clustered Hermes Indices
python index/create_clustered_indices.py --dataset-size 100k --num-indices 10
🔹 Flat Index
python index/create_flat_index.py --index-size 100k
📌 Custom Datasets: Modify the dataset loading logic in the index creation files. However, you must ensure that your dataset comes pre-encoded or modify the script logic to encode each document before adding it to the index.
For benchmarking, create synthetic indices:
python index/synthetic_create_monolithic_index.py --index-size 1m --dim 768 --threads 32
Measure retrieval latency and energy performance of Hermes.
📌 Pre-profiled results available at: Here
Example Retrieval latency tests:
python measurements/latency/retrieval_monolithic_latency.py \
--index-name data/indices/monolithic_indices/hermes_index_monolithic_100k.faiss \
--nprobe 128 \
--batch-size 16 32 64 \
--retrieved-docs 5 10 \
--num-threads 32 \
--queries triviaqa/triviaqa_encodings.npy
python measurements/latency/retrieval_split_latency.py \
--index-folder data/indices/split_indices \
--nprobe 128 \
--batch-size 32 64 \
--retrieved-docs 10 20 \
--num-threads 32 \
--dataset-size 1000000 \
--queries triviaqa/triviaqa_encodings.npy
python measurements/latency/retrieval_hermes_clusters_latency.py \
--index-folder data/indices/hermes_clusters \
--nprobe 8 128 \
--batch-size 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 \
--retrieved-docs 5 \
--num-threads 32 \
--queries triviaqa/triviaqa_encodings.npy
python measurements/latency/retrieval_hermes_sample_deep_latency.py \
--index-folder data/indices/hermes_clusters \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 10 20 \
--num-threads 32 \
--queries triviaqa/triviaqa_encodings.npy
python measurements/power/retrieval_monolithic_power.py \
--index-name data/indices/monolithic_indices/hermes_index_monolithic_100k.faiss \
--nprobe 128 \
--batch-size 16 32 64 \
--retrieved-docs 5 10 \
--num-threads 32 \
--queries triviaqa/triviaqa_encodings.npy
âš¡ Measure Latency and Power Usage for different encoding and inference models.
For a comprehensive list of supported inference models, please refer to the vllm documentation
Example Encoding and Inference latency tests:
python measurements/latency/encoding_latency.py \
--model-name BAAI/bge-large-en \ # Can most encoder models on huggingface
--batch-size 16 32 \
--input-lengths 16 32 64 128
python measurements/power/encoding_power.py \
--model-name BAAI/bge-large-en \ # Can most encoder models on huggingface
--batch-size 16 32 \
--input-lengths 16 32 64 128
python measurements/latency/inference_latency.py \
--model-name "google/gemma-2-9b" \ # Can support most inference models within vLLM documentation
--num-gpus 1 \
--batch-size 16 32 \
--input-lengths 32 128 512 \
--output-lengths 4 16 32
python measurements/power/inference_power.py \
--model-name "google/gemma-2-9b" \ # Can support most inference models within vLLM documentation
--num-gpus 1 \
--batch-size 16 32 \
--input-lengths 32 128 512 \
--output-lengths 4 16 32
âš¡ Latency Profiling at Different Frequencies
source measurements/dvfs/profile_dvfs_latency.sh \
--folder data/indices/hermes_clusters \ # File path to clustered indices folder
--queries triviaqa/triviaqa_encodings.npy \
--nprobe 128 \
--batch-size 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 \
--retrieved-docs 5 \
--num-threads 32
âš¡ IVF Search at Different Frequencies
Unfortunately, as of now there are no scripts for automatically measuring the power usage of Cluster search retrieval under different frequencies. So please follow the following steps to curate the list of powers by hand.
-
Set the System Frequency
Configure your system’s frequency by running:
bash measurements/dvfs/set_frequency.sh 3000000
-
Run the Stress Test and Monitor Power Usage:
Open two terminal windows and proceed as follows:
Terminal 1: Stress the System Run the stress script to initiate the IVF search. Wait for the "RUN" print statements to appear before proceeding with power measurements.
python measurements/dvfs/stress_ivf.py \ --index data/indices/monolithic_indices/hermes_index_monolithic_100k.faiss \ # Path to monolithic index or single index to profile IVF power for --nprobe 256 \ # Should be a high nProbe value --num-threads 32 \ # Should also be a high value that stresses system (ideally all threads) --queries triviaqa/triviaqa_encodings.npy
Terminal 2: Monitor Power Consumption
Monitor the power usage with one of the following commands:
- Using
perf stat
:
sudo perf stat -e power/energy-pkg/ -e power/energy-ram/ sleep 1
- Using Intel Rapl
sudo ./uarch-configure/rapl-read/rapl-read -s
- Using
Use perf list
to see a full list of all collectable power metrics with perf stat
This tool models and aggregates data across multiple nodes for system performance optimization.
📌 Includes:
- Trace Generator – Generate cluster access traces
- Multi-Node Aggregation – Analyze RAG inference latency and energy usage
Generate the cluster access traces
python modeling/trace_generator.py
python modeling/latency_sim.py \
--latency-data data/profiling/hermes_cluster_latency.csv \ # File path to profiled dvfs latency data on clustered indices
--query-trace data/modeling/cluster_trace.csv \ # File path to cluster access trace generated
--retrieved-docs 5 \
--batch-size 32 64 \
--sample-nprobe 8 \
--deep-nprobe 128 \
--num-threads 32
python measurements/accuracy/evaluate_retrieval_accuracy.py \
--flat-index data/indices/flat_indices/hermes_index_flat_100k.faiss \ # File path to flat index
--monolithic-index data/indices/monolithic_indices/hermes_index_monolithic_100k.faiss \ # File path to monolithic index
--split-index-folder data/indices/split_indices \ # File path to folder where split indices are
--split-index-size 100000 \ # How large of a dataset the indices are built on
--cluster-index-folder data/indices/hermes_clusters/clusters \ # File path to clustered indices folder
--cluster-index-indices-folder data/indices/hermes_clusters/cluster_indices \ # File path to cluster indicies ID's folder
--monolithic-nprobe 256 \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--queries triviaqa/triviaqa_encodings.npy
python modeling/dvfs_sim.py \
--latency-frequency-data data/profiling/hermes_frequency_cluster_latency.csv \ # File path to profiled dvfs latency data on clustered indices
--power-frequency-data data/profiling/hermes_frequency_cluster_power.csv \ # File path to profiled dvfs power data on IVF search
--query-trace data/modeling/cluster_trace.csv \ # File path to cluster access trace generated
--inference-trace data/profiling/inference_latency.csv \ # File path to profiled inference latencies
--retrieved-docs 5 \
--batch-size 32 \
--sample-nprobe 8 \
--deep-nprobe 128 \
--num-threads 32 \
--input-size 512 \
--stride-length 16
python figures/fig_06_rag_ttft_e2e_retrieval_overhead.py \
--input-size 512 \
--output-size 128 \
--stride-length 32 \
--batch-size 32 \
--monolithic-nprobe 256 \
--retrieved-docs 5 \
-monolithic-retrieval-trace data/profiling/retrieval_monolithic_latency.csv \ # File path to profiled monolithic retrieval index latency
--encoding-trace data/profiling/encoding_latency.csv \ # File path to profiled encoder model latency
--inference-trace data/profiling/inference_latency.csv # File path to profiled inference latencies
python figures/fig_11_hermes_accuracy_comparison.py \
--data-file data/accuracy_eval.csv \ # File path to output file produced from accuracy evaluation
--sample-nprobe 8 \
--deep-nprobe 128 \
--monolithic-nprobe 256
python figures/fig_12_hermes_nprobe_dse_ndcg.py \
--data-file data/accuracy_eval.csv \ # File path to output file produced from accuracy evaluation
--sample-nprobe 8 \
--deep-nprobe 128
python figures/fig_12_hermes_nprobe_dse_latency.py \
--data-file data/profiling/hermes_sample_deep_latency.csv # File path to output file produced from hermes sample deep latency script
--sample-nprobe 8 \
--deep-nprobe 128
python figures/fig_13_cluster_size_frequency_analysis.py \
--index-folder data/indices/hermes_clusters/clusters \ # File path to clustered indices folder
--cluster-access-trace data/modeling/cluster_trace.csv \ # File path to cluster access trace generated
--clusters-searched 5
python figures/fig_14_end_to_end_hermes_latency_comparison.py \
--input-size 512 \
--output-size 128 \
--stride-length 16 \
--batch-size 32 \
--monolithic-nprobe 256 \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--clusters-searched 4 \
--monolithic-retrieval-trace data/profiling/retrieval_monolithic_latency.csv \ # File path to profiled monolithic retrieval index latency
--hermes-retrieval-trace data/modeling/hermes_retrieval.csv \ # File path to hermes multi node aggregation retrieval latency output
--encoding-trace data/profiling/encoding_latency.csv \ # File path to profiled encoder model latency
--inference-trace data/profiling/inference_latency.csv # File path to profiled inference latencies
python figures/fig_14_end_to_end_hermes_energy_comparison.py \
--input-size 512 \
--output-size 128 \
--stride-length 16 \
--batch-size 1 \
--monolithic-nprobe 256 \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--clusters-searched 4 \
--hermes-retrieval-trace 100m_data/hermes_platinum_8380_100m_modeled_retrieval_energy.csv \ # File path to hermes multi node aggregation retrieval latency output
--monolithic-retrieval-trace 100m_data/monolithic_retrieval_latency.csv \ # File path to profiled monolithic retrieval index latency
--encoding-trace 100m_data/bge_large_latency.csv \ # File path to profiled encoder model latency
--inference-trace 100m_data/gemma_2_9b_latency.csv \ # File path to profiled inference latencies
--monolithic-retrieval-trace-power 100m_data/monolithic_retrieval_power.csv \ # File path to inference retrieval power
--encoding-trace-power 100m_data/bge_large_power.csv \ # File path to profiled encoder model power
--inference-trace-power 100m_data/gemma_2_9b_power.csv # File path to profiled inference power
python figures/fig_16_ttft_hermes_latency_comparison.py \
--input-size 512 \
--stride-length 16 \
--batch-size 32 \
--monolithic-nprobe 256 \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--monolithic-retrieval-trace data/profiling/retrieval_monolithic_latency.csv \ # File path to profiled monolithic retrieval index latency
--hermes-retrieval-trace data/modeling/hermes_retrieval.csv \ # File path to hermes multi node aggregation retrieval latency output
--encoding-trace data/profiling/encoding_latency.csv \ # File path to profiled encoder model latency
--inference-trace data/profiling/inference_latency.csv # File path to profiled inference latencies
python figures/fig_18_hermes_energy_throuhgput_analysis.py \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--batch-size 32 \
--clusters-searched 4 \
--hermes-retrieval-trace data/modeling/hermes_retrieval.csv \ # File path to hermes multi node aggregation retrieval latency output
--hermes-energy-trace data/modeling/hermes_retrieval_energy.csv # File path to hermes DVFS energy analysis output
python figures/fig_20_hermes_diff_hardware_comparison.py \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--batch-size 32 \
--hermes-retrieval-traces data/modeling/hermes_retrieval.csv # File path to hermes multi node aggregation retrieval latency output(s) Can be a string of multiple traces and files
python figures/fig_21_hermes_dvfs_analysis.py \
--sample-nprobe 8 \
--deep-nprobe 128 \
--retrieved-docs 5 \
--batch-size 32 \
--data-file data/modeling/hermes_retrieval_energy.csv # File path to hermes DVFS energy analysis output
This project is licensed under the MIT License. See the LICENSE file for full details.