8000 GitHub - Lishi905/SoMeLVLM: Repository for SoMeLVLM: A Large Vision Language Model for Social Media Processing
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Lishi905/SoMeLVLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoMeLVLM: A Large Vision Language Model for Social Media Processing

This is the repository for the ACL 2024 Findings paper: SoMeLVLM: A Large Vision Language Model for Social Media Processing. Xinnong Zhang*, Haoyu Kuang*

More Resources can be found on SoMeLVLM HomePage.

🎉🎉🎉[News 2024/05/16] SoMeLVLM has been accepted to ACL 2024 Findings!

framework3

Datasets & Model Weights

Model weights: Lishi0905/SoMeLVLM · Hugging Face

Plain text & Multimodal datasets: Request Form

Table of Contents

Abstract

The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.

Datasets

We have developed a 654k social media dataset SoMeData, which consists of five cognitive modules and various CSS task categories.

Module Category DataSize(SFT & Eval) Data Type
Knowledge & Comprehension Emotion 45.0k & 5.0k Text
20.3k & 1.5k Multi
Humor 18.0k & 8.3k Text
Figurative Language 12.5k & 4.6k Text
Misinformation 24.4k & 2.0k Text
6.5k & 0.5k Multi
Hate Speech & Toxicity 44.1k & 6.3k Text
13.8k & 1.4k Multi
Ideology & Stance 24.0k & 3.5k Text
1.6k & 0.3k Multi
Truthworthiness & Social Bias 11.0k & 3.2k Text
Social_factors 16.2k & 2.5k Text
40.0k & 1.0k Multi
Applying Emotion 20.0k & 5.0k Text
Humor 15.0k & 6.1k Text
Hate Speech & Toxicity 29.6k & 16.2k Text
Ideology & Stance 4.3k & 1.0k Text
Truthworthiness & Social Bias 30.0k & 0.9k Text
Social_factors 50.0k & 1.0k Multi
Analysis Figurative Language 30.0k & 2.2k Text
Emotion 20.3k & 1.5k Multi
Hate Speech & Toxicity 13.8k & 1.5k Multi
Social_factors 15.0k & 0.5k Multi
Evaluation Ideology & Stance 1.6k & 0.3k Multi
Misinformation 2.0k & 0.0k Text
6.5k & 0.5k Multi
Detoxifying Content 25.0k & 9.9k Text
Depolarizing Language 4.3k & 1.0k Text
Creation Invert Opinion 1.0k & 0.0k Text
Reverse ideology 4.3k & 1.0k Text
Social_factors 25k & 0.5k Multi

Experiment Results

We conduct both classification task and generation task on both plain text domain and multimodal domain.

Multimodal results

Classification tasks

Models Hate
Speech
Misinformation Social
Factors
Emotion Ideology Social Factors
OOD
Acc* Acc Acc* Acc Acc* Acc Acc* Acc Acc* Acc Acc* Acc
Instructblip_V 41.62 33.43 47.55 13.60 80.02 40.93 54.53 48.90 54.15 42.41 87.30 22.59
Instructblip_F 50.40 48.43 80.78 79.00 81.33 73.57 58.90 57.80 53.69 45.57 98.31 83.95
Blip2 52.14 52.14 80.60 80.60 81.83 80.89 57.73 57.73 53.48 53.48 99.15 95.69
Llava 53.35 9.79 84.67 25.40 72.49 6.69 53.39 10.10 49.79 1.58 93.75 3.08
MiniGPT4 45.12 23.00 65.30 54.20 64.08 36.18 53.13 29.48 42.13 8.86 69.58 34.29
SoMeLVLM 72.57 72.57 82.60 82.60 84.07 67.33 63.50 63.47 73.24 55.06 100.00 61.11

Generation tasks

Models Metrics Hate Speech Misinfor-
mation
Social Factors Emotion Ideology Social Factors OOD
Instructblip_V BLEU 0.65 1.09 6.21 0.85 0.60 1.14
ROUGE 3.13 0.88 9.02 7.26 4.89 14.03
GPT 1.83 2.84 1.46 1.96 1.61 2.07
Instructblip_F BLEU 0.24 0.05 1.16 0.28 0.78 1.51
ROUGE 2.79 0.81 14.60 13.69 8.36 16.91
GPT 2.11 2.85 2.12 3.02 1.62 2.16
Blip2 BLEU 0.62 0.02 0.76 0.16 0.25 0.65
ROUGE 2.25 1.89 11.99 14.82 4.35 12.87
GPT 1.86 2.72 1.89 3.08 2.34 1.61
Llava BLEU 0.36 0.00 1.89 0.64 1.10 2.29
ROUGE 4.52 0.01 12.80 5.74 8.73 20.10
GPT 1.23 0.81 1.80 1.25 1.21 2.27
Minigpt4 BLEU 0.43 0.69 1.20 0.55 0.32 1.98
ROUGE 8.84 12.15 17.20 10.81 12.68 20.73
GPT 2.28 2.18 1.59 2.37 1.28 1.84
SoMeLVLM BLEU 31.04 24.06 14.49 37.65 24.08 10.18
ROUGE 46.35 43.22 32.87 53.87 41.04 31.03
GPT 3.21 2.94 2.86 3.53 3.39 3.45

Plain text results

Classification tasks

Models Emotion Humor Figurative
Language
Misinfor-
mation
Hate
Speech
Ideology Trustworthi-
ness
Social
Factors
Vicuna-7b-v1.1 35.86 41.08 47.07 59.23 11.94 34.15 36.60 42.68
Llama2-7b-chat 40.54 61.31 53.77 41.11 12.84 37.77 59.21 31.61
ChatGLM2 41.20 36.94 52.05 47.21 14.67 30.07 68.44 48.23
SoMeLVLM 80.66 60.47 61.70 70.38 22.20 45.23 43.52 55.39

Generation tasks

Models Metric Emotion Humor Figurative
Language
Offensive-
ness
Ideology Trustworthi-
ness
Detoxifying
Content
Depolarizing
Language
Reverse
ideology
Vicuna-7b-v1.1 BLEU 7.97 10.49 8.03 7.01 9.36 9.70 10.43 22.31 33.40
ROGUE 31.31 36.21 31.55 31.24 32.78 34.13 27.96 42.72 51.76
GPT 3.23 3.24 2.57 3.63 3.41 3.13 2.50 3.26 2.98
Llama2-7b-chat BLEU 4.25 6.36 10.39 1.79 4.75 4.73 1.31 8.40 20.54
ROGUE 23.50 28.37 31.32 17.41 25.01 26.54 10.94 26.72 38.06
GPT 2.99 2.48 2.73 1.94 2.78 2.82 1.14 2.21 2.04
ChatGLM2 BLEU 6.60 8.98 7.20 4.50 6.59 9.25 6.84 13.33 21.91
ROGUE 29.47 34.49 29.07 28.05 29.94 34.35 23.92 35.66 42.27
GPT 3.05 2.37 2.06 2.93 2.86 2.73 2.00 2.80 2.80
SoMeLVLM BLEU 26.96 13.81 23.77 17.24 14.60 12.37 27.13 23.54 44.09
ROGUE 51.88 42.84 45.42 43.10 39.49 39.06 47.76 45.47 61.96
GPT 3.63 3.38 3.02 3.64 3.43 3.59 2.89 3.28 3.41

Cognitive abilities results

Comprehensive analysis according to the cognitive abilities.

radar3

Demo Examples

Knowledge & Comprehension

exam_knowledge

Analysis

exam_analysis

Creation

exam_creation

Reproduction

Inference

  1. The overall project is based on the LAVIS by Salesforce. To reproduce SoMeLVLM, you may prepare LAVIS environment first:

    conda create -n SoMeLVLM python=3.8
    conda activate SoMeLVLM
    git clone https://github.com/salesforce/LAVIS.git
    cd LAVIS
    pip install -e .
    

    Notice that we will modify model config during the inference, so we recommend to install LAVIS via git.

    The following steup 2 & 3 can be refered in the "Adding Models" section in the LAVIS Doc.

  2. Add SoMeLVLM.yaml to lavis model config in ./LAVIS/lavis/configs/models/blip2/ directory.

  3. Register SoMeLVLM to original blip2_vicuna_instruct at ./LAVIS/lavis/models/blip2_models/blip2_vicuna_instruct.py, Line29:

    PRETRAINED_MODEL_CONFIG_DICT = {
            "vicuna7b": "configs/models/blip2/blip2_instruct_vicuna7b.yaml",
            "vicuna13b": "configs/models/blip2/blip2_instruct_vicuna13b.yaml",
            "SoMeLVLM": "configs/models/blip2/SoMeLVLM.yaml"
        }
  4. Prepare model weights from Huggingface:

    • checkpoint.pth for connection module;
    • others for base language model, which should be under the ./llm/SoMeLVLM/ directory.
  5. Load SoMeLVLM Model:

    import torch
    from PIL import Image
    from lavis.
    7222
    models import load_model_and_preprocess
    
    device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
    model, vis_processors, _ = load_model_and_preprocess(name="blip2_vicuna_instruct", model_type="SoMeLVLM", is_eval=True, device=device)
    # load connection module checkpoint
    model.load_checkpoint(checkpoint_path)
  6. Start inference:

    raw_image = Image.open('your/img/path').convert("RGB")
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    prompt = "your prompt here."
    answer = model.generate({"image": image, "prompt": prompt})[0]

Ethics Statement

The data used in this paper are from real users in diverse social media platforms, so the privacy problem is treated cautiously. The data from opensource datasets are safe as the sensitive information has already been masked. For the data we collect, we strictly follow the privacy policy of social media platforms and will carefully avoid personal information before we release our instruction dataset.

Citation

If you find our SoMeLVLM or datasets useful, we will greatly appreciate it if you could consider citing our paper:

@inproceedings{zhang-etal-2024-somelvlm,
    title = "{S}o{M}e{LVLM}: A Large Vision Language Model for Social Media Processing",
    author = "Zhang, Xinnong  and
      Kuang, Haoyu  and
      Mou, Xinyi  and
      Lyu, Hanjia  and
      Wu, Kun  and
      Chen, Siming  and
      Luo, Jiebo  and
      Huang, Xuanjing  and
      Wei, Zhongyu",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.140",
    doi = "10.18653/v1/2024.findings-acl.140",
    pages = "2366--2389",
}

About

Repository for SoMeLVLM: A Large Vision Language Model for Social Media Processing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0