8000 GitHub - duyhominhnguyen/Exgra-Med: ExGra-Med: Medical Multi-Modal LLM with Extended Context Alignment
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

duyhominhnguyen/Exgra-Med

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EXGRA-MED: Extended Context Graph Alignment for Medical Vision-Language Models

ArXiv Hugging Face License: CC BY-NC 4.0 Website


Abstract

State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLAVA-MED and BIOMEDGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data.

To address this, we introduce EXGRA-MED, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization.

Alt text

πŸ† Key Results

βœ… Reveals the data inefficiency of autoregressive modeling β€” LLaVA-Med exhibits a significant performance drop when pre-trained on limited data, even after full fine-tuning on downstream tasks.

βœ… Matches LLaVA-Med’s performance on Medical VQA using only 10% of the pre-training data, demonstrating the data efficiency of EXGRA-MED.

βœ… Surpasses several SOTA medical multi-modal LLMs when pre-trained on the full PMC-15M dataset (100%) with LLaMA-7B, across diverse tasks:

  • (i) Medical Visual Question Answering (VQA)
  • (ii) Medical Visual Chatbot
  • (iii) Zero-shot Image Classification (as a VQA task)

Alt text


Table of Contents


πŸ“£ News

  • [Jun 2025] πŸ”“ Initial codebase release (preprocessing + VQA fine-tuning).
  • [Jun 2025] 🧩 Checkpoints for EXGRA-MED + DCI and three VQA fine-tuned models now available.
  • [Jun 2025] πŸ“Š Evaluation scripts and demo for the data-efficiency benchmark for VQA are online.
  • Coming Soon 🚧 Evaluation Scripts for Medical Visual Chatbot and Zero-shot Image Classification.
  • Coming Soon 🚧 ExGra-Med checkpoints are trained at large-scale data with 2.5M instruction tuning samples from MedTrinity-25M (10%).

πŸ“¦ Model Checkpoints

Model Description πŸ€— Download Link
llava-med-10 LLaVa-Med (10% pre-trained PMC-15M) Link
llava-med-40 LLaVa-Med (40% pre-trained PMC-15M) Link
exgra-med-10 ExGra-Med (10% pre-trained PMC-15M) Link
exgra-med-40 ExGra-Med (40% pre-trained PMC-15M) Link
exgra-med Our base EXGRA-MED model (100% pre-trained PMC-15M) Link
exgra-med-dci EXGRA-MED + DCI-enhanced version Link
exgra-med-dci-vqa-rad Fine-tuned on VQA-RAD Link
exgra-med-dci-slake Fine-tuned on SLAKE Link
exgra-med-dci-pathvqa Fine-tuned on PATH-VQA Link

Before starting the finetuning/inference/evaluation, download our finetuned checkpoints.

Download Checkpoints
cd pretrained/
# pip install -U huggingface_hub
# Download MERGE-Group/llava-med-10
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/llava-med-10 --local-dir llava-med-10

# Download MERGE-Group/llava-med-40
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/llava-med-40 --local-dir llava-med-40

# Download MERGE-Group/exgra-med-10
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-10 --local-dir exgra-med-10

# Download MERGE-Group/exgra-med-40
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-40 --local-dir exgra-med-40

# Download MERGE-Group/exgra-med
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med --local-dir exgra-med

# Download MERGE-Group/exgra-med-dci
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci --local-dir exgra-med-dci

# Download MERGE-Group/exgra-med-dci-vqa-rad
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci-vqa-rad --local-dir /exgra-med-dci-vqa-rad

# Download MERGE-Group/exgra-med-dci-slake
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci-slake --local-dir /exgra-med-dci-slake

# Download MERGE-Group/exgra-med-dci-pathvqa
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci-pathvqa --local-dir /exgra-med-dci-pathvqa

πŸ› οΈ Requirements and Installation

Basic Dependencies:

  • Python >= 3.10
  • Pytorch
  • CUDA driver

Note: Please check your CUDA driver to install a proper version of PyTorch. For instance, we provide a guideline for installation for CUDA 11:

conda create -n exgra-med python=3.10
conda activate exgra-med
pip install --upgrade pip
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install openai==0.27.8
pip install git+https://github.com/huggingface/transformers@cae78c46
pip install -e .
pip install einops ninja open-clip-torch shortuuid nltk

Also, based on your CUDA driver, please check the proper version of Flash Attention 2 at this link, and then install flash-attn package:

pip install flash-attn --no-build-isolation

Project Structure

  • assets/: Contains various assets used by the project (e.g., images, supplementary files).
  • scripts/: Houses utility bash scripts.
  • exgra_med/: The main source code directory for the exgra_med package/application.
    • data_preprocessing/: Scripts and modules related to data preprocessing.
    • llava/: Specific modules or components related to llava.
      • eval/: Code for evaluating llava models.
      • instruct/: Code related to instructing llava models.
      • model/: Contains llava model definitions or related utilities.
      • notebook/: Jupyter notebooks for experimentation or demonstration related to llava.
      • serve/: Code for serving llava models (e.g., API endpoints).
      • train/: Training scripts and configurations for llava.
    • untar_files.py: A Python script possibly used for decompressing or extracting files.
  • LICENSE: The license under which the project is distributed.
  • pyproject.toml: A file used for specifying project build system requirements and project metadata (part of PEP 517/518).
  • README.md: This README file, providing an overview of the project.

πŸ“„ Dataset Configuration Files

We provide pre-built .json configuration files for all datasets used in VQA training and evaluation. These files specify paths, splits, and preprocessing parameters necessary for seamless execution.

Dataset Task Config File Description Download Link
VQA-RAD VQA Train/val splits, QA pairs vqa_rad_config.json
SLAKE VQA Train/val splits, QA pairs slake_config.json
PATH-VQA VQA Train/val splits, QA pairs pathvqa_config.json

To download our language-image multimodal instruction-folllowing dataset, please run the following script:

bash scripts/download_data.sh

πŸ”— Instructions:

  • Place the downloaded .json files under the configs/datasets/ directory.

  • Update file paths inside if needed to match your local dataset locations.


πŸ”§ Fine-tuning on VQA Tasks

We provide ready-to-use scripts to fine-tune EXGRA-MED and EXGRA-MED + DCI on three popular medical VQA benchmarks: VQA-RAD, SLAKE, and PATH-VQA.

Each script uses one of our pretrained checkpoints as the starting point. πŸ‘‰ Before running, make sure to update the --model_name_or_path in each .sh file to point to the correct location of the downloaded model.

# Example: Fine-tune on VQA-RAD
bash scripts/llava1-5_stage2_data_rad.sh         # without DCI
bash scripts/llava1-5_stage2_data_rad_dci.sh     # with DCI

# Fine-tune on SLAKE
bash scripts/llava1-5_stage2_slake.sh            # without DCI
bash scripts/llava1-5_stage2_slake_dci.sh        # with DCI

# Fine-tune on PATH-VQA
bash scripts/llava1-5_stage2_pvqa.sh            # without DCI
bash scripts/llava1-5_stage2_pvqa_dci.sh        # with DCI

πŸ“ˆ Evaluation

You can run evaluation for each of the three key tasks:

1. Medical VQA Evaluation

# supports VQA-RAD, SLAKE, PATH-VQA

# change the following
# --model-name: Path to load the model from finetuning stage
# --answers-file: file to store the result (i.e the answers to the medical question)
python exgra_med/llava/eval/run_med_datasets_eval_batch.py \
--num-chunks 2 \
--model-name \<output_vqa_rad_checkpoint\> \
--mm_dense_connector_type none \
--num_l 6 \
--question-file ./data_RAD/test_w_options_new.json \
--image-folder ./data_RAD/images \
--answers-file \<answers_file\>

#change the following
#--pred: same as --answers-file above
# the metrics (recall and accuracy) are saved as a text file in the same place, with the same name as --pred. 
#E.g: if --pred is ans-opt-new-3.jsonl, then metrics are saved in ans-opt-new-3.txt
python exgra_med/llava/eval/run_eval.py \
--gt ./data_RAD/test_w_options_new.json \
--pred \<answers_file\> \
--candidate ./data_RAD/candidate.json

2. Medical Visual Chatbot

🚧 To be updated!

bash scripts/eval_chatbot.sh

3. Zero-shot Image Classification

By reformulating image classification as visual question answering, we can generate predictions by solving the VQA task with multiple-choice questions.

🚧 To be updated!

bash scripts/eval_zero_shot.sh

πŸ”¬ Data Efficiency Demonstration (10% vs 40%)

To replicate our findings on LLAVA-MED’s data inefficiency and the strength of EXGRA-MED with 10% and 40% data (Tables 1 & 2 in the paper):

# Fine-tune EXGRA-MED with 10%/40% data on VQA task
bash scripts/train_exgra_10percent.sh #
# Fine-tune checkpoint LLaVa-Med with 10%/40% data on VQA task
bash scripts/train_llava_10percent.sh

Citation

If you find this work useful, please cite our paper:

@article{nguyen2025exgra,
  title={EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models},
  author={Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert},
  journal={arXiv preprint arXiv:2410.02615},
  year={2025}
}

Usage and License Notices:

The data, code, and model checkpoints are intended and licensed for research use only. They are also subject to additional restrictions dictated by the Terms of Use: LLaMA, Vicuna and GPT-4 respectively. The data is made available under CC BY NC 4.0. The data, code, and model checkpoints may be used for non-commercial purposes and any models trained using the dataset should be used only for research purposes. It is expressly prohibited for models trained on this data to be used in clinical care or for any clinical decision making purposes.

0