🤗 Model | 💬 Demo: Chat with Dimple | 📑 Paper | ✨ Code
Dimple is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that leverages a hybrid training paradigm combining autoregressive and diffusion-based instruction tuning. The model architecture is similar to Qwen and LLaVA, while introducing an autoregressive-then-diffusion training strategy:
- Stage 1: Autoregressive fine-tuning for alignment and initial instruction tuning.
- Stage 2: Diffusion-based fine-tuning for enhanced instruction-following capabilities.
Trained on the same dataset as LLaVA-NEXT, Dimple-7B surpasses LLaVA-NEXT-7B by 3.9%, demonstrating that diffusion-based multimodal large language models can match its autoregressive counterparts under similar training budget.
- Hybrid Training: Combines autoregressive and diffusion training.
- Diffusion Decoding: Supports confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
- Controllable Generation: Enables fine-grained control over format, structure, and length via structure priors.
- Autoregressive-like Prefilling: Enhances inference speed using prefilling techniques.
Benchmark | Dimple-7B (ours) | LLaVA-1.5-7B | LLaVA-NEXT-7B | Eagle-7B | Eagle2-9B | Qwen-VL-7B | Qwen2.5-VL-7B |
---|---|---|---|---|---|---|---|
Training Samples | 1.3M | 1.2M | 1.3M | 2.4M | 27.8M | 1.5B | - |
Training Tokens | 0.8B | - | - | - | - | - | 2.6T |
Base LLM | Dream (Qwen2.5) | Vicuna | Vicuna-1.5 | Vicuna | Qwen2.5 | Qwen | Qwen2.5 |
GQA | 59.2 | 62.0 | 64.8 | 64.9 | - | 59.3 | - |
MMBench (en test) | 74.6 | 64.3 | 68.7 | 68.4 | - | - | 83.5 |
MME (Perception) | 1514 | 1510 | 1519 | 1528 | - | - | - |
MME (Cognition) | 432 | - | 332 | - | - | - | - |
MME (Total) | 1946 | - | 1851 | - | - | - | 2347 |
POPE | 86.2 | 85.8 | 86.7 | 88.8 | - | - | - |
MMMU (val) | 45.2 | - | 35.8 | 36.3 | 56.1 | - | 58.6 |
SQA (img) | 77.1 | 66.8 | 72.8 | 70.0 | - | - | - |
AI2D | 74.4 | - | 65.4 | - | 83.9 | 62.3 | 83.9 |
ChartQA | 63.4 | - | 54.9 | 67.7 | 86.4 | 65.7 | 87.3 |
TextVQA | 61.6 | - | 64.8 | - | 83.0 | - | - |
OCRBench | 565 | - | 490 | 529 | - | - | - |
MathVista (mini) | 42.3 | - | 33.0 | - | 63.8 | 37.0 | 68.2 |
MMVet | 41.2 | 31.1 | 47.3 | - | 62.2 | - | 67.1 |
Make sure your environment includes the following versions:
transformers==4.46.2
torch==2.5.1
accelerate==1.6.0
import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image
model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
[{"role": "user", "content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "Describe this image."}
]}],
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]
inputs = processor(
text=text,
images=images,
videos=None,
padding="longest",
return_tensors="pt",
)
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=64,
output_history=True,
return_dict_in_generate=True,
steps=64,
temperature=0.2,
top_p=0.95,
alg="origin",
use_cache=True,
alg_p_threshold=0.95,
use_original_confidence=True,
decoding_pipeline="dim",
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.
We evaluate Dimple using lmms-eval. The evaluation scripts are provided in the lmms-eval
folder, and the evaluation commands for Dimple can be found at lmms-eval/examples/models/dimple.sh
.
To run the evaluation, follow these steps:
-
Install Dimple dependencies Make sure all required packages for Dimple are installed. Refer to Dimple’s environment setup instructions for details.
-
Install lmms-eval dependencies Next, install the necessary dependencies for
lmms-eval
. These are listed in thelmms-eval
repository. -
Set your
OPENAI_API_KEY
Some tasks require OpenAI API access. To do this, edit the filelmms-eval/examples/models/dimple.sh
and replaceOPENAI_API_KEY=MY_OPENAI_API_KEY
with your actual API key.
Once all dependencies are installed and your API key is set, you can run the evaluation script directly:
sh lmms-eval/examples/models/dimple.sh
This will execute the evaluation pipeline for Dimple using the default configuration.
Feel free to join the Dimple Community for in-depth discussions and idea exchange!
Citation information will be provided soon. Please stay tuned if you are interested in citing Dimple in your work.