This is the repository for the ACL 2024 Findings paper: SoMeLVLM: A Large Vision Language Model for Social Media Processing. Xinnong Zhang*, Haoyu Kuang*
More Resources can be found on SoMeLVLM HomePage.
🎉🎉🎉[News 2024/05/16] SoMeLVLM has been accepted to ACL 2024 Findings!
Model weights: Lishi0905/SoMeLVLM · Hugging Face
Plain text & Multimodal datasets: Request Form
The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.
We have developed a 654k social media dataset SoMeData, which consists of five cognitive modules and various CSS task categories.
Module | Category | DataSize(SFT & Eval) | Data Type |
---|---|---|---|
Knowledge & Comprehension | Emotion | 45.0k & 5.0k | Text |
20.3k & 1.5k | Multi | ||
Humor | 18.0k & 8.3k | Text | |
Figurative Language | 12.5k & 4.6k | Text | |
Misinformation | 24.4k & 2.0k | Text | |
6.5k & 0.5k | Multi | ||
Hate Speech & Toxicity | 44.1k & 6.3k | Text | |
13.8k & 1.4k | Multi | ||
Ideology & Stance | 24.0k & 3.5k | Text | |
1.6k & 0.3k | Multi | ||
Truthworthiness & Social Bias | 11.0k & 3.2k | Text | |
Social_factors | 16.2k & 2.5k | Text | |
40.0k & 1.0k | Multi | ||
Applying | Emotion | 20.0k & 5.0k | Text |
Humor | 15.0k & 6.1k | Text | |
Hate Speech & Toxicity | 29.6k & 16.2k | Text | |
Ideology & Stance | 4.3k & 1.0k | Text | |
Truthworthiness & Social Bias | 30.0k & 0.9k | Text | |
Social_factors | 50.0k & 1.0k | Multi | |
Analysis | Figurative Language | 30.0k & 2.2k | Text |
Emotion | 20.3k & 1.5k | Multi | |
Hate Speech & Toxicity | 13.8k & 1.5k | Multi | |
Social_factors | 15.0k & 0.5k | Multi | |
Evaluation | Ideology & Stance | 1.6k & 0.3k | Multi |
Misinformation | 2.0k & 0.0k | Text | |
6.5k & 0.5k | Multi | ||
Detoxifying Content | 25.0k & 9.9k | Text | |
Depolarizing Language | 4.3k & 1.0k | Text | |
Creation | Invert Opinion | 1.0k & 0.0k | Text |
Reverse ideology | 4.3k & 1.0k | Text | |
Social_factors | 25k & 0.5k | Multi |
We conduct both classification task and generation task on both plain text domain and multimodal domain.
Classification tasks
Models | Hate Speech |
Misinformation | Social Factors |
Emotion | Ideology | Social Factors OOD |
||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc* | Acc | Acc* | Acc | Acc* | Acc | Acc* | Acc | Acc* | Acc | Acc* | Acc | |
Instructblip_V | 41.62 | 33.43 | 47.55 | 13.60 | 80.02 | 40.93 | 54.53 | 48.90 | 54.15 | 42.41 | 87.30 | 22.59 |
Instructblip_F | 50.40 | 48.43 | 80.78 | 79.00 | 81.33 | 73.57 | 58.90 | 57.80 | 53.69 | 45.57 | 98.31 | 83.95 |
Blip2 | 52.14 | 52.14 | 80.60 | 80.60 | 81.83 | 80.89 | 57.73 | 57.73 | 53.48 | 53.48 | 99.15 | 95.69 |
Llava | 53.35 | 9.79 | 84.67 | 25.40 | 72.49 | 6.69 | 53.39 | 10.10 | 49.79 | 1.58 | 93.75 | 3.08 |
MiniGPT4 | 45.12 | 23.00 | 65.30 | 54.20 | 64.08 | 36.18 | 53.13 | 29.48 | 42.13 | 8.86 | 69.58 | 34.29 |
SoMeLVLM | 72.57 | 72.57 | 82.60 | 82.60 | 84.07 | 67.33 | 63.50 | 63.47 | 73.24 | 55.06 | 100.00 | 61.11 |
Generation tasks
Models | Metrics | Hate Speech | Misinfor- mation |
Social Factors | Emotion | Ideology | Social Factors OOD |
---|---|---|---|---|---|---|---|
Instructblip_V | BLEU | 0.65 | 1.09 | 6.21 | 0.85 | 0.60 | 1.14 |
ROUGE | 3.13 | 0.88 | 9.02 | 7.26 | 4.89 | 14.03 | |
GPT | 1.83 | 2.84 | 1.46 | 1.96 | 1.61 | 2.07 | |
Instructblip_F | BLEU | 0.24 | 0.05 | 1.16 | 0.28 | 0.78 | 1.51 |
ROUGE | 2.79 | 0.81 | 14.60 | 13.69 | 8.36 | 16.91 | |
GPT | 2.11 | 2.85 | 2.12 | 3.02 | 1.62 | 2.16 | |
Blip2 | BLEU | 0.62 | 0.02 | 0.76 | 0.16 | 0.25 | 0.65 |
ROUGE | 2.25 | 1.89 | 11.99 | 14.82 | 4.35 | 12.87 | |
GPT | 1.86 | 2.72 | 1.89 | 3.08 | 2.34 | 1.61 | |
Llava | BLEU | 0.36 | 0.00 | 1.89 | 0.64 | 1.10 | 2.29 |
ROUGE | 4.52 | 0.01 | 12.80 | 5.74 | 8.73 | 20.10 | |
GPT | 1.23 | 0.81 | 1.80 | 1.25 | 1.21 | 2.27 | |
Minigpt4 | BLEU | 0.43 | 0.69 | 1.20 | 0.55 | 0.32 | 1.98 |
ROUGE | 8.84 | 12.15 | 17.20 | 10.81 | 12.68 | 20.73 | |
GPT | 2.28 | 2.18 | 1.59 | 2.37 | 1.28 | 1.84 | |
SoMeLVLM | BLEU | 31.04 | 24.06 | 14.49 | 37.65 | 24.08 | 10.18 |
ROUGE | 46.35 | 43.22 | 32.87 | 53.87 | 41.04 | 31.03 | |
GPT | 3.21 | 2.94 | 2.86 | 3.53 | 3.39 | 3.45 |
Classification tasks
Models | Emotion | Humor | Figurative Language |
Misinfor- mation |
Hate Speech |
Ideology | Trustworthi- ness |
Social Factors |
---|---|---|---|---|---|---|---|---|
Vicuna-7b-v1.1 | 35.86 | 41.08 | 47.07 | 59.23 | 11.94 | 34.15 | 36.60 | 42.68 |
Llama2-7b-chat | 40.54 | 61.31 | 53.77 | 41.11 | 12.84 | 37.77 | 59.21 | 31.61 |
ChatGLM2 | 41.20 | 36.94 | 52.05 | 47.21 | 14.67 | 30.07 | 68.44 | 48.23 |
SoMeLVLM | 80.66 | 60.47 | 61.70 | 70.38 | 22.20 | 45.23 | 43.52 | 55.39 |
Generation tasks
Models | Metric | Emotion | Humor | Figurative Language |
Offensive- ness |
Ideology | Trustworthi- ness |
Detoxifying Content |
Depolarizing Language |
Reverse ideology |
---|---|---|---|---|---|---|---|---|---|---|
Vicuna-7b-v1.1 | BLEU | 7.97 | 10.49 | 8.03 | 7.01 | 9.36 | 9.70 | 10.43 | 22.31 | 33.40 |
ROGUE | 31.31 | 36.21 | 31.55 | 31.24 | 32.78 | 34.13 | 27.96 | 42.72 | 51.76 | |
GPT | 3.23 | 3.24 | 2.57 | 3.63 | 3.41 | 3.13 | 2.50 | 3.26 | 2.98 | |
Llama2-7b-chat | BLEU | 4.25 | 6.36 | 10.39 | 1.79 | 4.75 | 4.73 | 1.31 | 8.40 | 20.54 |
ROGUE | 23.50 | 28.37 | 31.32 | 17.41 | 25.01 | 26.54 | 10.94 | 26.72 | 38.06 | |
GPT | 2.99 | 2.48 | 2.73 | 1.94 | 2.78 | 2.82 | 1.14 | 2.21 | 2.04 | |
ChatGLM2 | BLEU | 6.60 | 8.98 | 7.20 | 4.50 | 6.59 | 9.25 | 6.84 | 13.33 | 21.91 |
ROGUE | 29.47 | 34.49 | 29.07 | 28.05 | 29.94 | 34.35 | 23.92 | 35.66 | 42.27 | |
GPT | 3.05 | 2.37 | 2.06 | 2.93 | 2.86 | 2.73 | 2.00 | 2.80 | 2.80 | |
SoMeLVLM | BLEU | 26.96 | 13.81 | 23.77 | 17.24 | 14.60 | 12.37 | 27.13 | 23.54 | 44.09 |
ROGUE | 51.88 | 42.84 | 45.42 | 43.10 | 39.49 | 39.06 | 47.76 | 45.47 | 61.96 | |
GPT | 3.63 | 3.38 | 3.02 | 3.64 | 3.43 | 3.59 | 2.89 | 3.28 | 3.41 |
Comprehensive analysis according to the cognitive abilities.
-
The overall project is based on the LAVIS by Salesforce. To reproduce SoMeLVLM, you may prepare LAVIS environment first:
conda create -n SoMeLVLM python=3.8 conda activate SoMeLVLM git clone https://github.com/salesforce/LAVIS.git cd LAVIS pip install -e .
Notice that we will modify model config during the inference, so we recommend to install LAVIS via git.
The following steup 2 & 3 can be refered in the "Adding Models" section in the LAVIS Doc.
-
Add SoMeLVLM.yaml to lavis model config in ./LAVIS/lavis/configs/models/blip2/ directory.
-
Register SoMeLVLM to original blip2_vicuna_instruct at ./LAVIS/lavis/models/blip2_models/blip2_vicuna_instruct.py, Line29:
PRETRAINED_MODEL_CONFIG_DICT = { "vicuna7b": "configs/models/blip2/blip2_instruct_vicuna7b.yaml", "vicuna13b": "configs/models/blip2/blip2_instruct_vicuna13b.yaml", "SoMeLVLM": "configs/models/blip2/SoMeLVLM.yaml" }
-
Prepare model weights from Huggingface:
- checkpoint.pth for connection module;
- others for base language model, which should be under the ./llm/SoMeLVLM/ directory.
-
Load SoMeLVLM Model:
import torch from PIL import Image from lavis. 7222 models import load_model_and_preprocess device = torch.device("cuda") if torch.cuda.is_available() else "cpu" model, vis_processors, _ = load_model_and_preprocess(name="blip2_vicuna_instruct", model_type="SoMeLVLM", is_eval=True, device=device) # load connection module checkpoint model.load_checkpoint(checkpoint_path)
-
Start inference:
raw_image = Image.open('your/img/path').convert("RGB") image = vis_processors["eval"](raw_image).unsqueeze(0).to(device) prompt = "your prompt here." answer = model.generate({"image": image, "prompt": prompt})[0]
The data used in this paper are from real users in diverse social media platforms, so the privacy problem is treated cautiously. The data from opensource datasets are safe as the sensitive information has already been masked. For the data we collect, we strictly follow the privacy policy of social media platforms and will carefully avoid personal information before we release our instruction dataset.
If you find our SoMeLVLM or datasets useful, we will greatly appreciate it if you could consider citing our paper:
@inproceedings{zhang-etal-2024-somelvlm,
title = "{S}o{M}e{LVLM}: A Large Vision Language Model for Social Media Processing",
author = "Zhang, Xinnong and
Kuang, Haoyu and
Mou, Xinyi and
Lyu, Hanjia and
Wu, Kun and
Chen, Siming and
Luo, Jiebo and
Huang, Xuanjing and
Wei, Zhongyu",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.140",
doi = "10.18653/v1/2024.findings-acl.140",
pages = "2366--2389",
}