Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification
Accepted by AAAI 2025
[Paper] | [Recent News] | [Running the Code] | [Acknowledgments] | [License and Terms of Use] | [Citation]
Abstract: Whole Slide Image (WSI) classification has very significant applications in clinical pathology, e.g., tumor identification and cancer diagnosis. Currently, most research attention is focused on Multiple Instance Learning (MIL) using static datasets. One of the most obvious weaknesses of these methods is that they cannot efficiently preserve and utilize previously learned knowledge. With any new data arriving, classification models are required to be re-trained on both previous and current new data. To overcome this shortcoming and break through traditional vision modality, this paper proposes the first Vision-Language-based framework with Queryable Prototype Multiple Instance Learning (QPMIL-VL) specially designed for incremental WSI classification. This framework mainly consists of two information processing branches: one is for generating bag-level features by prototype-guided aggregation of instance features, while the other is for enhancing class features through a combination of class ensemble, tunable vector and class similarity loss. The experiments on four public WSI datasets demonstrate that our QPMIL-VL framework is effective for incremental WSI classification and often significantly outperforms other compared methods, achieving state-of-the-art (SOTA) performance.
- 04/11/2025: Our paper has been published in the AAAI 2025 proceedings.
- 01/23/2025: 🎉 One co-authored paper, VLSA, is accepted by ICLR 2025. Refer to its code & paper for more details. Congratulations to Pei Liu.
- 12/15/2024: Update our prepared dataset to facilitate direct online viewing of the dataset's file organization and allow for on-demand downloading of specific files.
- 12/10/2024: 🥳 Our QPMIL-VL is accepted by AAAI 2025 and its code & paper (containing Supplementary Material) are live.
On updating. Stay tuned.
First of all, you could download the pre-trained weight of pathology VLM CONCH here (official link).
We use CLAM to crop non-overlapping 256 × 256
patches from the segmented tissue at 10×
magnification. Then, pre-trained image encoder in CONCH is used to extract instance features. You could move to Pipeline-Processing-TCGA-Slides-for-MIL for a detailed tutorial.
Of course, you could also use the dataset we prepared directly (corresponding compressed file is here).
All experiments are run on a machine with
- two NVIDIA GeForce RTX 3090 GPUs
- python==3.8 and pytorch==1.11.0+cu113
Detailed package requirements:
- for
pip
orconda
users, full requirements are provided in requirements.txt. - for
docker
users, you could use our base docker image viadocker pull yuukilp/deepath:py38-torch1.11.0-cuda11.3-cudnn8-devel
and then install additional essential python packages (see requirements.txt) in the container.
All important arguments are explained in configs/main.yaml
. You could replace the values of dataset_root_dir
and conch_ckpt_path
with the root directory of dataset and the path of CONCH pre-trained weight respectively.
Finally, in the project directory scripts/
, execute the following command (ten-fold cross-validation):
./main.sh
This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No.62476049.
Some parts of codes in this repo are adapted from the following amazing works. We thank the authors and developers for their selfless contributions.
- Luping Ji, Pei Liu: Provide detailed guidance.
- CONCH: Our QPMIL-VL is driven by this great pathology VLM.
- AttriCLIP: A prompt tuning approach for incremental learning of natural image based on CLIP.
- L2P: A prompt-query mechanism based on the pre-trained ViT model is designed to mitigate catastrophic forgetting in incremental learning for natural image.
- TOP: Proposes a two-level prompt learning MIL framework based on GPT-4 and CLIP for Few-shot WSI Classification (FSWC) problem.
- TaskRes: An efficient fine-tuning method of VL pre-training model is proposed.
ⓒ UESTC. The models and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of the QPMIL-VL model and its derivatives are prohibited and require prior approval. If you are a commercial entity, please contact the corresponding author (Luping Ji).
If you find this work helps your research, please consider citing our paper:
@inproceedings{gou2025queryable,
title={Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification},
author={Gou, Jiaxiang and Ji, Luping and Liu, Pei and Ye, Mao},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={3},
pages={3158--3166},
year={2025}
}
Additionally, our another work, for the first time, proposes a new Vision-Language-based Survival Analysis (VLSA) paradigm. If you find VLSA useful, please also consider citing the corresponding paper:
@inproceedings{liu2025interpretable,
title={Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology},
author={Liu, Pei and Ji, Luping and Gou, Jiaxiang and Fu, Bo and Ye, Mao},
booktitle={International conference on learning representations},
year={2025},
url={https://arxiv.org/abs/2409.09369}
}