CHESS: Contextual Harnessing for Efficient SQL Synthesis

This repository contains the code and data for the paper "CHESS: Contextual Harnessing for Efficient SQL Synthesis."

Translating natural language questions into SQL queries, known as text-to-SQL, is a long-standing research problem. Effective text-to-SQL synthesis can become very challenging due to:

(i) The extensive size of database catalogs (descriptions of tables and their columns) and database values,
(ii) Reasoning over large database schemas,
(iii) Ensuring the functional validity of the generated queries,
(iv) Navigating the ambiguities of natural language questions.

We introduce CHESS, a Large Language Model (LLM) based multi-agent framework for efficient and scalable SQL synthesis, comprising four specialized agents, each targeting one of the aforementioned challenges:

Information Retriever (IR): Extracts relevant data.
Schema Selector (SS): Prunes large schemas.
Candidate Generator (CG): Generates high-quality candidates and refines queries iteratively.
Unit Tester (UT): Validates queries through LLM-based natural language unit tests.

Our framework offers configurable features that adapt to various deployment constraints:

Key Features

Industrial-Scale Database Support: Using the Schema Selector agent, CHESS efficiently narrows down very large database schemas into manageable sub-schemas, boosting system accuracy by approximately 2% and reducing LLM token usage by 5x.
Privacy-Preserving Performance: Among methods using open-source models, CHESS achieves state-of-the-art performance, providing a high-performing, privacy-preserving system suitable for industrial deployment.
Scalability: In settings with high computational budgets, CHESS reaches 71.10% accuracy on the BIRD test set, within 2% of the leading proprietary method, while reducing LLM calls by approximately 83%.

CHESS

Setting up the Environment

Clone the repository:

git clone https://github.com/yourusername/CHESS.git
cd CHESS

Create a .env file in the root directory and add the following configuration:

DATA_MODE="dev"
DATA_PATH="./data/dev/dev.json"
DB_ROOT_DIRECTORY="./data/dev/dev_databases"
DATA_TABLES_PATH="./data/dev/dev_tables.json"
INDEX_SERVER_HOST='localhost'
INDEX_SERVER_PORT=12345

OPENAI_API_KEY=
GCP_PROJECT=''
GCP_REGION='us-central1'
GCP_CREDENTIALS=''
GOOGLE_CLOUD_PROJECT=''

Install required packages:
```
pip install -r requirements.txt
```

Preprocessing

To retrieve database catalogs and find the most similar database values to a question, preprocess the databases:

Run the preprocessing script:
```
sh run/run_preprocess.sh
```
This will create the minhash, LSH, and vector databases fo 7670 r each of the databases in the specified directory.

Running the Code

After preprocessing the databases, generate SQL queries for the BIRD dataset by choosing a configuration:

Run the main script:

sh run/run_main_ir_cg_ut.sh

or

sh run/run_main_ir_ss_ch.sh

Sub-sampled Development Set (SDS)

The sub-sampled development set (SDS) is a subset of the BIRD dataset with 10% of samples from each database. It is used for ablation studies and is available in sub_sampled_bird_dev_set.json.

Supporting Other LLMs

To use your own LLM, modify the get_llm_chain(engine, temperature, base_uri=None) function and add your LLM in run/langchain_utils.py.

Citation

If you find this repository helpful, please cite the following paper:

@article{talaei2024chess,
  title={CHESS: Contextual Harnessing for Efficient SQL Synthesis},
  author={Talaei, Shayan and Pourreza, Mohammadreza and Chang, Yu-Chen and Mirhoseini, Azalia and Saberi, Amin},
  journal={arXiv preprint arXiv:2405.16755},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/dev		data/dev
images		images
results		results
run		run
src		src
templates		templates
.gitconfig.save		.gitconfig.save
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dotenv_copy		dotenv_copy
env.example		env.example
model_deployment.txt		model_deployment.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Key Features

CHESS

Setting up the Environment

Preprocessing

Running the Code

Sub-sampled Development Set (SDS)

Supporting Other LLMs

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

ShayanTalaei/CHESS

Folders and files

Latest commit

History

Repository files navigation

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Key Features

CHESS

Setting up the Environment

Preprocessing

Running the Code

Sub-sampled Development Set (SDS)

Supporting Other LLMs

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages