This is the official GitHub repository for Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents.
Files in the ./agents/evaluation/database
directory:
{test|train|validation}_ref_info.jsonl
: reference information used for scoring.
This script generates evaluation datasets by removing constraints from the original TravelPlanner dataset.
There are two main generation strategies:
- Removing single constraints
- Removing pairs of constraints (global-local, local-global combinations)
from dataset_generate import generate_dataset
# Generate test datasets with single constraints removed
generate_dataset(
save_path="./evaluation/database",
single_constraints=['budget', 'house rule', 'room type', 'cuisine']
)
# Generate test datasets with constraint pairs removed
generate_dataset(
save_path="./evaluation/database",
global_constraints=['budget', 'people_number'],
local_constraints=['house rule', 'room type', 'cuisine']
)
Datasets generated from the validation set are available in the ./agents/evaluation/database
directory:
val_dataset_without_{constraint_type}_one.json
: datasets for two-turn evaluation (single constraint removed).val_dataset_without_two_constraints_{constratint_combination}.json
: datasets for three-turn evaluation (pairs of constraints removed)../preference/val_dataset_full_{budget_size}_budget.json
: datasets for priority-aware evaluation
After generating the evaluation datasets, use this script (evaluate.py
) to run evaluations in different modes.
Evaluates the model's performance when a single constraint is removed.
python evaluate.py --mode single_constraint \
--constraints "budget,room type" \
--output_dir "./results/two_turn"
Evaluates how the model handles cases where pairs of constraints are removed.
python evaluate.py --mode two_constraints \
--constraint_pairs "global_local,local_global" \
--output_dir "./results/three_turn"
Evaluates the model on all examples with specified difficulty levels.
Exclude easy examples
python evaluate.py --mode all_at_once \
--difficulty not_easy \
--output_dir "./results/one_turn"
Evaluates the model's handling of preference-based constraints across different budget types.
python evaluate.py --mode preference \
--budget_types "high,middle,small" \
--preference_types "cuisine,rating" \
--output_dir "./results/preference"
You can adjust the evaluation mode by setting the history option in the .config/test.yaml
file:
1
: keeps track of all previous logs interactively.0
: provides a summary of history instead of storing the full log.
TBD