This repository is the official implementation of WWW 2025 Paper "Compress and Mix: Advancing Efficient Taxonomy Completion with Large Language Models". It includes:
-
COMI Reproduction: Easily reproduce the COMI model as described in our paper.
-
Backbone Models: Includes base backbone models such as PromptLM and PretrainLM to facilitate TC research.
-
Open-Sourced Compressed Tokens: We provide compressed tokens to support further research and inspire more efficient and effective TC techniques.
Please cite the paper if you find the code helpful. Thanks!
To reproduce the experiments, ensure your environment matches the required specifications.
Both environment.yaml
and requirements.txt
specify all necessary packages and versions.
The dependencies can be set up using either conda or pip:
# conda
conda env create -f environment.yaml
# pip
pip install -r requirements.txt
All hyper-parameters and training settings are defined in config files. To modify these settings, edit the appropriate config file.
Details of all configuration parameters are provided in:
./config_files/config.explain.json
Refer to this file to understand and adjust training settings as needed.
Follow the steps below to train the COMI model.
When running the program for the first time, necessary data preprocessing will take some time. Subsequently, essential intermediate files will be stored in pickle format. For future runs, simply set the raw
parameter to False
and existing_partition
to True
in the MAGDataset
within the Dataloader
to load the intermediate files and avoid repeated processing.
Alternatively, you can download the datasets and intermediate files directly from here and put them under data/
for the experiments.
To train the model from scratch and generate compressed tokens, use the following command to perform Semantic Compression:
python train_id.py --config './config_files/<TAXO_NAME>/config.SemanticCompression.json'
Note: You can directly use our compressed tokens in the ./compressed_token/
directory since this stage requires a certain amount of GPU resources .
With the precomputed compressed tokens provided in ./compressed_token/
, you can directly proceed to Contrastive Structure Modeling using the following command:
python train_id.py --config './config_files/<TAXO_NAME>/config.StructureContrastive.json'
Replace <TAXO_NAME>
with the name of the taxonomy corresponding to your dataset. Available options include food
, mesh
, or SemEval-V
.