Zeqi Xiao1
Yushi Lan1
Yifan Zhou1
Wenqi Ouyang1
Shuai Yang2
Yanhong Zeng3
Xingang Pan1
1S-Lab, Nanyang Technological University,
2Wangxuan Institute of Computer Technology, Peking University,
3Shanghai AI Laboratry
demo.1.1.mp4
conda create python=3.10 -n worldmem
conda activate worldmem
pip install -r requirements.txt
conda install -c conda-forge ffmpeg=4.3.2
python app.py
To enable cloud logging with Weights & Biases (wandb), follow these steps:
-
Sign up for a wandb account.
-
Run the following command to log in:
wandb login
-
Open
configurations/training.yaml
and set theentity
andproject
field to your wandb username.
Download pretrained weights from Oasis.
Training the model on 4 H100 GPUs, it converges after approximately 500K steps. We observe that gradually increasing task difficulty improves performance. Thus, we adopt a multi-stage training strategy: ,
sh train_stage_1.sh # Small range, no vertical turning
sh train_stage_2.sh # Large range, no vertical turning
sh train_stage_3.sh # Large range, with vertical turning
To resume training from a previous checkpoint, configure the resume
and output_dir
variables in the corresponding .sh
script.
To run inference:
sh infer.sh
You can either load the diffusion model and VAE separately:
+diffusion_model_path=yslan/worldmem_checkpoints/diffusion_only.ckpt \
+vae_path=yslan/worldmem_checkpoints/vae_only.ckpt \
+customized_load=true \
+seperate_load=true \
Or load a combined checkpoint:
+load=your_model_path \
+customized_load=true \
+seperate_load=false \
Download the Minecraft dataset from Hugging Face
Place the dataset in the following directory structure:
data/
└── minecraft/
├── training/
└── validation/
- Release inference models and weights;
- Release training pipeline on Minecraft;
- Release training data on Minecraft;
If you find our work helpful, please cite:
@misc{xiao2025worldmemlongtermconsistentworld,
title={WORLDMEM: Long-term Consistent World Simulation with Memory},
author={Zeqi Xiao and Yushi Lan and Yifan Zhou and Wenqi Ouyang and Shuai Yang and Yanhong Zeng and Xingang Pan},
year={2025},
eprint={2504.12369},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.12369},
}
- Diffusion Forcing: Diffusion Forcing provides flexible training and inference strategies for our methods.
- Minedojo: We collect our Minecraft dataset from Minedojo.
- Open-oasis: Our model architecture is based on Open-oasis. We also use pretrained VAE and DiT weight from it.