Description
Hi, thanks your excellent work !
I reproduce your code and find that the training time is too long.
The training time showed on my server is about 25 days, the scripts such as follows:
CUDA_VISIBLE_DEVICES=0
python -m torch.distributed.launch --nproc_per_node=1 ./CIRKD-main/train_cirkd_segformer.py
--teacher-model segformer
--student-model segformer
--teacher-backbone MiT_B4
--student-backbone MiT_B0
--dataset citys
--data ./CIRKD-main/dataset/cityscapes/
--optimizer-type adamw
--pixel-memory-size 2000
--lr 0.0002
--batch-size 4
--workers 16
--crop-size 512 1024
--max-iterations 320000
--save-dir ./CIRKD-main/work_dirs/mit_b0_OnlyKD
--log-dir ./CIRKD-main/work_dirs/mit_b0_OnlyKD
--teacher-pretrained ./segformers/mit_b4.pth
--student-pretrained ./segformers/mit_b0.pth
And the original scripts such as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 --master_addr 127.5.0.4 --master_port 26501
train_cirkd_segformer.py
--teacher-model segformer
--student-model segformer
--teacher-backbone MiT_B4
--student-backbone MiT_B0
--dataset citys
--data --data [your dataset path]/cityscapes/
--batch-size 8
--workers 16
--crop-size 1024 1024
--optimizer-type adamw
--pixel-memory-size 2000
--region-contrast-size 1024
--pixel-contrast-size 4096
--kd-temperature 1
--lambda-kd 1.0
--contrast-kd-temperature 1.
--lambda-kd 1.
--lambda-minibatch-pixel 1.
--lambda-memory-pixel 0.1
--lambda-memory-region 0.1
--lr 0.0002
--max-iterations 160000
--save-dir [your directory path to store checkpoint files]
--log-dir [your directory path to store log files]
--gpu-id 0,1,2,3,4,5,6,7
--teacher-pretrained [your teacher weights path]/segformer_MiT_B4_citys_best_model.pth
--student-pretrained [your pretrained-backbone path]/mit_b0.pth
The reason I use the 320K training iterations is that I use 4 as the batch size.
I only have 1 4090 GPU and cannot run with 16 batchsize following the original CIRKD_Segformer setting. So I want to use the bs4/320K iters comparable with bs8/160K, is it right way to reproduce it?
Could you please share the log file of training train_cirkd_segformer.py?
I'd like to know about your advices and if some parameters are set incorrectly?
Looking forward to your reply. Thanks!
Best Regards!
Jun