Multimodal Planning Framework

This repository includes code and materials for the ACL2025-Findings paper "Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation".

🚀 Features

Multiple AI Backends: Currently support for GPT-4o, Gemini, and Mistral models
Various Processing Modes: From simple text-based to complex visual-enhanced planning
Modular Architecture: Easy to extend with new processors and backbones

📁 Project Structure

├── planner.py           # Main entry point
├── config.py            # Configuration management
├── models.py            # Model backends and management
├── processors.py        # Task processing implementations
├── pipeline.py          # Main processing pipeline
├── utils.py             # Utility functions and helpers
└── README.md            # This file

🛠️ Installation

Clone the repository:

git clone <repository-url>
cd llm_planning

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export GOOGLE_API_KEY="your-google-api-key"

🚀 Usage

Basic Usage

python planner.py --mode vanilla --backbone gpt4o --model gpt-4o --save_dir ./results

Advanced Usage

python planner.py \
  --mode tip \
  --backbone gem \
  --model gemini-1.5-flash \
  --data_dir ./dataset/tasks.csv \
  --save_dir ./results \
  --start_idx 0 \
  --end_idx 50 \
  --temperature 0.2 \
  --seed 42

Available Commands

List available modes:
```
python planner.py --list_modes
```

Validate configuration:

python planner.py --validate_config --backbone gpt4o --model gpt-4o --save_dir ./test

Dry run (show what would be processed):

python planner.py --dry_run --mode vanilla --backbone gpt4o --save_dir ./results

🎯 Processing Modes

Mode	Description
`vanilla`	Generate textual plan first, then create visual plan
`stable`	Generate visual plan first, then create texual plan
`tip`	TIP-based image generation and revision
`w_des`	Textual plan refinement with detailed image descriptions
`w_img`	Textual plan refinement with visual interpretation
`ours`	Ours approach with pPDDL visual information and coherent image generation

🤖 Supported Backbones

GPT-4O

Models: gpt-4o, gpt-4o-mini
Requires: OPENAI_API_KEY

Gemini

Models: gemini-1.5-flash
Requires: GOOGLE_API_KEY

Mistral

Models: mistral-7b, mistral-8x7b

📊 Output Structure

Each processed task creates a directory with the following structure:

results/
└── task_0/
    ├── ori_plan.txt      # Original generated plan
    ├── rev_plan.txt      # Revised plan (if applicable)
    ├── descriptions.txt  # Image descriptions
    ├── captions.txt      # Image captions
    ├── step_1.png        # Generated images
    ├── step_2.png
    └── ...

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal Planning Framework

🚀 Features

📁 Project Structure

🛠️ Installation

🚀 Usage

Basic Usage

Advanced Usage

Available Commands

🎯 Processing Modes

🤖 Supported Backbones

GPT-4O

Gemini

Mistral

📊 Output Structure

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
config.py		config.py
models.py		models.py
pipeline.py		pipeline.py
planner.py		planner.py
processors.py		processors.py
requirements.txt		requirements.txt

License

psunlpgroup/MPlanner

Folders and files

Latest commit

History

Repository files navigation

Multimodal Planning Framework

🚀 Features

📁 Project Structure

🛠️ Installation

🚀 Usage

Basic Usage

Advanced Usage

Available Commands

🎯 Processing Modes

🤖 Supported Backbones

GPT-4O

Gemini

Mistral

📊 Output Structure

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages