ESM+ is a new metric for the Text-to-SQL task. ESM+ calculates semantic accuracy with a lower rate of false positives than Execution accuracy and a lower rate of false negatives than Exact Set Matching. It is released along with our baselines, as well as several other state of the art model outputs. This repo contains all the code necessary for evaluation.
ESMp.py
and esmp_process_sql.py
are written in Python 3.10, and are modeled after the test-suite-sql-eval.
Just like in the original evaluation scripts, to run this evaluation you need gold and predicted txt files. Examples of these are linked in spider_dev, spider_test, and cosql_dev. In each of these folders,
gold.txt
: gold file where each line isgold SQL \t db_id
GPT4Turbo.txt
: GPT4Turbo baseline predictionsClaude.txt
: Claude3Opus baseline predictionsC3.txt
: C3 model predictionsDAIL.txt
: DAIL model predictionsDIN.txt
: DIN model predictionsRASAT+PICARD.txt
: RASAT+PICARD predictionsRESDSQL.txt
: RESDSQL predictionsGraphix.txt
: Graphix predictionsSTAR.txt
: STAR predictions
For the dev sets, predictions are taken directly from the corresponding githubs, with the exception of RASAT+PICARD, which was reproduced. For spider_test, the predictions were reproduced using the same process as the original, but could have different results.
First, download the database folders for spider (dev and test) and cosql (only dev). Save the database folders into spider_dev, spider_test, and cosql_dev, respectively.
Then, create a conda environment:
conda create -n "ESMp" python=3.10.0
conda activate ESMp
Install packages:
pip install -r requirements.txt
To run our script, use the following command:
python3 ESMp.py --gold path/to/gold.txt --pred path/to/pred.txt --db path/to/database/ --table path/to/tables.json
--gold
: gold txt file.
--pred
: predictions txt file.
--db
: directory of databases.
--table
: tables json file.
--etype
: same as previous. Note that exe has been updated according to the paper. Default is match (ESM+).
--plug_value
: same as previous. Note that this metric is designed for models that do predict values.
--progress_bar_for_each_datapoint
: same as previous
--disable_value
: add if you want to disable value checks, strongly discouraged.
--disable_distinct
: add if you want to disable distinct checks, strongly discouraged.
--disable_rules
: Takes a list of comma separated rules, none, or all. Rule numbers correspond to those in Table 1 of our paper. Default is none.
--verbose
: add if you want information like which rules are being applied on each comparison.
Default configuration is to run ESM+ on spider's test set, with our baseline GPT4Turbo predictions.
We introduced two new baselines. These are stored in the baselines folder.
To begin, save the spider and cosql datasets into baselines/.
To run, first put your LLM keys in llm.py.
Then install requirements:
pip install -r requirements.txt
Then, baselines can be run using:
python3 spider.py
python3 cosql.py