Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

[ICASSP 2025] Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Authors: Yuzhe Weng, Haotian Wang, Tian Gao, Kewei Li, Shutong Niu, Jun Du

🔥 News

🔥(Mar 24, 2025) The representations used for training have been open sourced!
(Oct 19, 2024) The final model weights are shared on Google Drive!
The project page is uploaded on the Github!

⭐ Overview

The overall architecture:

In multimodal sentiment analysis, collecting text data is often more challenging than video or audio. To address this challenge, our study has developed a robust model that effectively integrates multimodal sentiment information, even in the absence of text modality. Specifically, we have developed a Double- 9F58 Flow Self-Distillation Framework, including Unified Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA), which excels at processing both scenarios with complete modalities and those with missing text modality. When the text modality is missing, our framework uses the LLM-based model to simulate the text representation from the audio modality. To further align the simulated and real representations, we have also introduced the Rank-N Contrast (RNC) loss function. When testing on the CMU-MOSEI, our model achieved outstanding performance on MAE and significantly outperformed other models when text modality is missing.

🚀 Weights & Representation

Model	Complete Modality MSE	Text Modality Missing MSE	File Size	Link
SDUMC	0.5060	0.5503	49MB	[Google Drive]

Representation: Baidu Drive

https://pan.baidu.com/s/1iHbWPZps-uidqRflAnKnFw?pwd=cqdb -> 
features_mosei/manet_FRA 
features_mosei/vicuna-7b-v1.5-FRA-wavlm2vicuna-half-gt
features_mosei/vicuna-7b-v1.5-FRA-wavlm2vicuna-half-wav+prompt[take_generate_wordembed_-4]
features_mosei/wavlm-large-FRA_-5

Dataset Labels [Google Drive]

label_official.npz -> dataset/datasets_label/cmumosei-process/

🔧 Usage

Requirements

Python >= 3.9

Pytorch >= 1.8.0

pip install -r requirements.txt

Inference & Evaluation

If you wish to run inference to evaluate the model's performance, please download the model weights and modality representations into their respective directories. The directory structure should be as follows:


├── checkpoints
│   └── mosei_mult-view_kd_full_0.5060_0.5503.pt
└── dataset
    ├── datasets_label
    │   └── cmumosei-process
	│		└── label_official.npz
    └── features_mosei
        ├── manet_FRA
        ├── vicuna-7b-v1.5-FRA-wavlm2vicuna-half-gt
        ├── vicuna-7b-v1.5-FRA-wavlm2vicuna-half-wav+prompt[take_generate_wordembed_-4]
        └── wavlm-large-FRA_-5

Run the script to view inference results:

bash ./shell/main_text_missing_icassp_inference.sh

Training

Training with extracted representations

If you want to directly use the representations we have extracted for you to train the model, you can download the representations directly from the README link and run the following script directly:

bash ./shell/main_text_missing_icassp.sh

Training with representations extracted by yourself

If you want to extract representations yourself for related experiments, you can refer to the following configuration.

Build ./tools folder

## for face extractor (OpenFace-win)
https://drive.google.com/file/d/1-O8epcTDYCrRUU_mtXgjrS3OWA4HTp0-/view?usp=share_link  -> tools/openface_win_x64
## for visual feature extraction
https://drive.google.com/file/d/1wT2h5sz22SaEL4YTBwTIB3WoL4HUvg5B/view?usp=share_link ->  tools/manet

## for audio extraction
https://www.johnvansickle.com/ffmpeg/old-releases ->  tools/ffmpeg-4.4.1-i686-static
## for acoustic features
https://huggingface.co/microsoft/wavlm-large -> tools/transformers/wavlm-large

## for text features
https://huggingface.co/lmsys/vicuna-7b-v1.5 ->  tools/transformers/vicuna-7b-v1.5

## for simulated text representation
# details: https://github.com/X-LANCE/SLAM-LLM/blob/main/examples/asr_librispeech/README.md
https://drive.google.com/file/d/1cLNuMR05oXxKj8M_Z3yAZ5JHJ06ybIHp/view?usp=sharing  ->  tools/transformers/WalmL2VicunaV1.5_model.pt

You can refer to the run.sh file in each directory of ./features_extraction to extract each representation.

💻 Results

This work ablates various designs of the model and demonstrates the effectiveness of each design.

Compared with recent models that have performed well on this task, our model achieves optimal performance in both complete and missing modes.

🌠 Acknowledgements

Thanks to open source repository MERTools, we have done a lot of work based on it.

📰Citation

If you find our work useful in your research, please consider citing:

@inproceedings{weng2025enhancing,
  title={Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention},
  author={Weng, Yuzhe and Wang, Haotian and Gao, Tian and Li, Kewei and Niu, Shutong and Du, Jun},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
dataset		dataset
feature_extraction		feature_extraction
imgs		imgs
shell		shell
toolkit		toolkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main_frame_val_text_missing.py		main_frame_val_text_missing.py
main_frame_val_text_missing_inference.py		main_frame_val_text_missing_inference.py
metric.py		metric.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

🔥 News

⭐ Overview

🚀 Weights & Representation

🔧 Usage

Requirements

Inference & Evaluation

Training

Training with extracted representations

Training with representations extracted by yourself

💻 Results

🌠 Acknowledgements

📰Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

WarmCongee/SDUMC

Folders and files

Latest commit

History

Repository files navigation

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

🔥 News

⭐ Overview

🚀 Weights & Representation

🔧 Usage

Requirements

Inference & Evaluation

Training

Training with extracted representations

Training with representations extracted by yourself

💻 Results

🌠 Acknowledgements

📰Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages