This repository contains the code and data for the paper "CHESS: Contextual Harnessing for Efficient SQL Synthesis."
Translating natural language questions into SQL queries, known as text-to-SQL, is a long-standing research problem. Effective text-to-SQL synthesis can become very challenging due to:
- (i) The extensive size of database catalogs (descriptions of tables and their columns) and database values,
- (ii) Reasoning over large database schemas,
- (iii) Ensuring the functional validity of the generated queries,
- (iv) Navigating the ambiguities of natural language questions.
We introduce CHESS, a Large Language Model (LLM) based multi-agent framework for efficient and scalable SQL synthesis, comprising four specialized agents, each targeting one of the aforementioned challenges:
- Information Retriever (IR): Extracts relevant data.
- Schema Selector (SS): Prunes large schemas.
- Candidate Generator (CG): Generates high-quality candidates and refines queries iteratively.
- Unit Tester (UT): Validates queries through LLM-based natural language unit tests.
Our framework offers configurable features that adapt to various deployment constraints:
- Industrial-Scale Database Support: Using the Schema Selector agent, CHESS efficiently narrows down very large database schemas into manageable sub-schemas, boosting system accuracy by approximately 2% and reducing LLM token usage by 5x.
- Privacy-Preserving Performance: Among methods using open-source models, CHESS achieves state-of-the-art performance, providing a high-performing, privacy-preserving system suitable for industrial deployment.
- Scalability: In settings with high computational budgets, CHESS reaches 71.10% accuracy on the BIRD test set, within 2% of the leading proprietary method, while reducing LLM calls by approximately 83%.
-
Clone the repository:
git clone https://github.com/yourusername/CHESS.git cd CHESS
-
Create a
.env
file in the root directory and add the following configuration:DATA_MODE="dev" DATA_PATH="./data/dev/dev.json" DB_ROOT_DIRECTORY="./data/dev/dev_databases" DATA_TABLES_PATH="./data/dev/dev_tables.json" INDEX_SERVER_HOST='localhost' INDEX_SERVER_PORT=12345 OPENAI_API_KEY= GCP_PROJECT='' GCP_REGION='us-central1' GCP_CREDENTIALS='' GOOGLE_CLOUD_PROJECT=''
-
Install required packages:
pip install -r requirements.txt
To retrieve database catalogs and find the most similar database values to a question, preprocess the databases:
-
Run the preprocessing script:
sh run/run_preprocess.sh
This will create the minhash, LSH, and vector databases fo 7670 r each of the databases in the specified directory.
After preprocessing the databases, generate SQL queries for the BIRD dataset by choosing a configuration:
-
Run the main script:
sh run/run_main_ir_cg_ut.sh
or
sh run/run_main_ir_ss_ch.sh
The sub-sampled development set (SDS) is a subset of the BIRD dataset with 10% of samples from each database. It is used for ablation studies and is available in sub_sampled_bird_dev_set.json
.
To use your own LLM, modify the get_llm_chain(engine, temperature, base_uri=None)
function and add your LLM in run/langchain_utils.py
.
If you find this repository helpful, please cite the following paper:
@article{talaei2024chess,
title={CHESS: Contextual Harnessing for Efficient SQL Synthesis},
author={Talaei, Shayan and Pourreza, Mohammadreza and Chang, Yu-Chen and Mirhoseini, Azalia and Saberi, Amin},
journal={arXiv preprint arXiv:2405.16755},
year={2024}
}