SoMeLVLM: A Large Vision Language Model for Social Media Processing

This is the repository for the ACL 2024 Findings paper: SoMeLVLM: A Large Vision Language Model for Social Media Processing. Xinnong Zhang*, Haoyu Kuang*

More Resources can be found on SoMeLVLM HomePage.

🎉🎉🎉[News 2024/05/16] SoMeLVLM has been accepted to ACL 2024 Findings!

Datasets & Model Weights

Model weights: Lishi0905/SoMeLVLM · Hugging Face

Plain text & Multimodal datasets: Request Form

Abstract

The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.

Datasets

We have developed a 654k social media dataset SoMeData, which consists of five cognitive modules and various CSS task categories.

Module	Category	DataSize(SFT & Eval)	Data Type
Knowledge & Comprehension	Emotion	45.0k & 5.0k	Text
	Emotion	20.3k & 1.5k	Multi
	Humor	18.0k & 8.3k	Text
	Figurative Language	12.5k & 4.6k	Text
	Misinformation	24.4k & 2.0k	Text
	Misinformation	6.5k & 0.5k	Multi
	Hate Speech & Toxicity	44.1k & 6.3k	Text
	Hate Speech & Toxicity	13.8k & 1.4k	Multi
	Ideology & Stance	24.0k & 3.5k	Text
	Ideology & Stance	1.6k & 0.3k	Multi
	Truthworthiness & Social Bias	11.0k & 3.2k	Text
	Social_factors	16.2k & 2.5k	Text
	Social_factors	40.0k & 1.0k	Multi
Applying	Emotion	20.0k & 5.0k	Text
	Humor	15.0k & 6.1k	Text
	Hate Speech & Toxicity	29.6k & 16.2k	Text
	Ideology & Stance	4.3k & 1.0k	Text
	Truthworthiness & Social Bias	30.0k & 0.9k	Text
	Social_factors	50.0k & 1.0k	Multi
Analysis	Figurative Language	30.0k & 2.2k	Text
	Emotion	20.3k & 1.5k	Multi
	Hate Speech & Toxicity	13.8k & 1.5k	Multi
	Social_factors	15.0k & 0.5k	Multi
Evaluation	Ideology & Stance	1.6k & 0.3k	Multi
	Misinformation	2.0k & 0.0k	Text
	Misinformation	6.5k & 0.5k	Multi
	Detoxifying Content	25.0k & 9.9k	Text
	Depolarizing Language	4.3k & 1.0k	Text
Creation	Invert Opinion	1.0k & 0.0k	Text
	Reverse ideology	4.3k & 1.0k	Text
	Social_factors	25k & 0.5k	Multi

Experiment Results

We conduct both classification task and generation task on both plain text domain and multimodal domain.

Multimodal results

Classification tasks

Models	Hate Speech		Misinformation		Social Factors		Emotion		Ideology		Social Factors OOD
Models	Acc*	Acc	Acc*	Acc	Acc*	Acc	Acc*	Acc	Acc*	Acc	Acc*	Acc
Instructblip_V	41.62	33.43	47.55	13.60	80.02	40.93	54.53	48.90	54.15	42.41	87.30	22.59
Instructblip_F	50.40	48.43	80.78	79.00	81.33	73.57	58.90	57.80	53.69	45.57	98.31	83.95
Blip2	52.14	52.14	80.60	80.60	81.83	80.89	57.73	57.73	53.48	53.48	99.15	95.69
Llava	53.35	9.79	84.67	25.40	72.49	6.69	53.39	10.10	49.79	1.58	93.75	3.08
MiniGPT4	45.12	23.00	65.30	54.20	64.08	36.18	53.13	29.48	42.13	8.86	69.58	34.29
SoMeLVLM	72.57	72.57	82.60	82.60	84.07	67.33	63.50	63.47	73.24	55.06	100.00	61.11

Generation tasks

Models	Metrics	Hate Speech	Misinfor- mation	Social Factors	Emotion	Ideology	Social Factors OOD
Instructblip_V	BLEU	0.65	1.09	6.21	0.85	0.60	1.14
	ROUGE	3.13	0.88	9.02	7.26	4.89	14.03
	GPT	1.83	2.84	1.46	1.96	1.61	2.07
Instructblip_F	BLEU	0.24	0.05	1.16	0.28	0.78	1.51
	ROUGE	2.79	0.81	14.60	13.69	8.36	16.91
	GPT	2.11	2.85	2.12	3.02	1.62	2.16
Blip2	BLEU	0.62	0.02	0.76	0.16	0.25	0.65
	ROUGE	2.25	1.89	11.99	14.82	4.35	12.87
	GPT	1.86	2.72	1.89	3.08	2.34	1.61
Llava	BLEU	0.36	0.00	1.89	0.64	1.10	2.29
	ROUGE	4.52	0.01	12.80	5.74	8.73	20.10
	GPT	1.23	0.81	1.80	1.25	1.21	2.27
Minigpt4	BLEU	0.43	0.69	1.20	0.55	0.32	1.98
	ROUGE	8.84	12.15	17.20	10.81	12.68	20.73
	GPT	2.28	2.18	1.59	2.37	1.28	1.84
SoMeLVLM	BLEU	31.04	24.06	14.49	37.65	24.08	10.18
	ROUGE	46.35	43.22	32.87	53.87	41.04	31.03
	GPT	3.21	2.94	2.86	3.53	3.39	3.45

Plain text results

Classification tasks

Models	Emotion	Humor	Figurative Language	Misinfor- mation	Hate Speech	Ideology	Trustworthi- ness	Social Factors
Vicuna-7b-v1.1	35.86	41.08	47.07	59.23	11.94	34.15	36.60	42.68
Llama2-7b-chat	40.54	61.31	53.77	41.11	12.84	37.77	59.21	31.61
ChatGLM2	41.20	36.94	52.05	47.21	14.67	30.07	68.44	48.23
SoMeLVLM	80.66	60.47	61.70	70.38	22.20	45.23	43.52	55.39

Generation tasks

Models	Metric	Emotion	Humor	Figurative Language	Offensive- ness	Ideology	Trustworthi- ness	Detoxifying Content	Depolarizing Language	Reverse ideology
Vicuna-7b-v1.1	BLEU	7.97	10.49	8.03	7.01	9.36	9.70	10.43	22.31	33.40
	ROGUE	31.31	36.21	31.55	31.24	32.78	34.13	27.96	42.72	51.76
	GPT	3.23	3.24	2.57	3.63	3.41	3.13	2.50	3.26	2.98
Llama2-7b-chat	BLEU	4.25	6.36	10.39	1.79	4.75	4.73	1.31	8.40	20.54
	ROGUE	23.50	28.37	31.32	17.41	25.01	26.54	10.94	26.72	38.06
	GPT	2.99	2.48	2.73	1.94	2.78	2.82	1.14	2.21	2.04
ChatGLM2	BLEU	6.60	8.98	7.20	4.50	6.59	9.25	6.84	13.33	21.91
	ROGUE	29.47	34.49	29.07	28.05	29.94	34.35	23.92	35.66	42.27
	GPT	3.05	2.37	2.06	2.93	2.86	2.73	2.00	2.80	2.80
SoMeLVLM	BLEU	26.96	13.81	23.77	17.24	14.60	12.37	27.13	23.54	44.09
	ROGUE	51.88	42.84	45.42	43.10	39.49	39.06	47.76	45.47	61.96
	GPT	3.63	3.38	3.02	3.64	3.43	3.59	2.89	3.28	3.41

Cognitive abilities results

Comprehensive analysis according to the cognitive abilities.

Demo Examples

Knowledge & Comprehension

Analysis

Creation

Reproduction

Inference

The overall project is based on the LAVIS by Salesforce. To reproduce SoMeLVLM, you may prepare LAVIS environment first:
```
conda create -n SoMeLVLM python=3.8
conda activate SoMeLVLM
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
```
Notice that we will modify model config during the inference, so we recommend to install LAVIS via git.

The following steup 2 & 3 can be refered in the "Adding Models" section in the LAVIS Doc.
Add SoMeLVLM.yaml to lavis model config in ./LAVIS/lavis/configs/models/blip2/ directory.

Register SoMeLVLM to original blip2_vicuna_instruct at ./LAVIS/lavis/models/blip2_models/blip2_vicuna_instruct.py, Line29:

PRETRAINED_MODEL_CONFIG_DICT = {
        "vicuna7b": "configs/models/blip2/blip2_instruct_vicuna7b.yaml",
        "vicuna13b": "configs/models/blip2/blip2_instruct_vicuna13b.yaml",
        "SoMeLVLM": "configs/models/blip2/SoMeLVLM.yaml"
    }

Prepare model weights from Huggingface:
- checkpoint.pth for connection module;
- others for base language model, which should be under the ./llm/SoMeLVLM/ directory.

Load SoMeLVLM Model:

import torch
from PIL import Image
from lavis.
7222
models import load_model_and_preprocess

device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
model, vis_processors, _ = load_model_and_preprocess(name="blip2_vicuna_instruct", model_type="SoMeLVLM", is_eval=True, device=device)
# load connection module checkpoint
model.load_checkpoint(checkpoint_path)

Start inference:

raw_image = Image.open('your/img/path').convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
prompt = "your prompt here."
answer = model.generate({"image": image, "prompt": prompt})[0]

Ethics Statement

The data used in this paper are from real users in diverse social media platforms, so the privacy problem is treated cautiously. The data from opensource datasets are safe as the sensitive information has already been masked. For the data we collect, we strictly follow the privacy policy of social media platforms and will carefully avoid personal information before we release our instruction dataset.

Citation

If you find our SoMeLVLM or datasets useful, we will greatly appreciate it if you could consider citing our paper:

@inproceedings{zhang-etal-2024-somelvlm,
    title = "{S}o{M}e{LVLM}: A Large Vision Language Model for Social Media Processing",
    author = "Zhang, Xinnong  and
      Kuang, Haoyu  and
      Mou, Xinyi  and
      Lyu, Hanjia  and
      Wu, Kun  and
      Chen, Siming  and
      Luo, Jiebo  and
      Huang, Xuanjing  and
      Wei, Zhongyu",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.140",
    doi = "10.18653/v1/2024.findings-acl.140",
    pages = "2366--2389",
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
SoMeData_testset/Multimodal_testset		SoMeData_testset/Multimodal_testset
assets		assets
configs		configs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SoMeLVLM: A Large Vision Language Model for Social Media Processing

Datasets & Model Weights

Table of Contents

Abstract

Datasets

Experiment Results

Multimodal results

Plain text results

Cognitive abilities results

Demo Examples

Knowledge & Comprehension

Analysis

Creation

Reproduction

Inference

Ethics Statement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

License

Lishi905/SoMeLVLM

Folders and files

Latest commit

History

Repository files navigation

SoMeLVLM: A Large Vision Language Model for Social Media Processing

Datasets & Model Weights

Table of Contents

Abstract

Datasets

Experiment Results

Multimodal results

Plain text results

Cognitive abilities results

Demo Examples

Knowledge & Comprehension

Analysis

Creation

Reproduction

Inference

Ethics Statement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages