We introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments for diverse exploration. ScreenExplorer is trained to explore and interact with the screen environment, learning to interact effectively with environments based on screenshots and a fixed instruction to encourage exploration.
ScreenExplorer-3B-E1 Video
ScreenExplorer-3B-E1.mov
ScreenExplorer-7B-E1 Video
ScreenExplorer-7B-E1.mov
ScreenExplorer/
├── requirements.txt
└── src/
├── schema
│ ├── action_selection_by_vlm_en.txt: The fixed instruction to encourage exploration
│ ├── action_selection.py: Action selection schema
│ └── __init__.py
├── screen_env
│ ├── asyncvnc.py: VNC client for screen interaction
│ └── screen_env.py: Environment wrapper for screen-based interaction
├── train_explorer.py: Main training script for the explorer agent
├── exploration_reward.py: Exploration rewards
├── online_eval.py: Online evaluation script
├── rollout_buffer.py: Manages experience rollouts for training
├── utils.py
└── world_model.py: World model implementation
-
Download Cosmos-Tokenizer-CI16x16 pretrained checkpoint from here and put it in
src/pretrained_ckpts/
directory. -
Make sure you have downloaded base model
Qwen/Qwen2.5-VL-3B-Instruct
orQwen/Qwen2.5-VL-7B-Instruct
from huggingface. And meta-llama/Llama-3.2-1B for world model. -
Setup docker environment for screen environment:
docker pull sgccr.ccs.tencentyun.com/screenagent/screenagent:2.0 # in global
# or
docker pull ccr.ccs.tencentyun.com/screenagent/screenagent:2.0 # in China
Train 3B model on 1 GPU:
cd src
export CUDA_VISIBLE_DEVICES=0
python train_explorer.py \
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
--world_model_name_or_path meta-llama/Llama-3.2-1B \
--cosmos_tokenizer_pretrained_ckpts ./pretrained_ckpts \
--cosmos_tokenizer_model_name Cosmos-Tokenizer-CI16x16 \
--image_name sgccr.ccs.tencentyun.com/screenagent/screenagent:2.0 \
--save_checkpoint_interval 10
Train 7B model on 2 GPU:
cd src
export CUDA_VISIBLE_DEVICES=0,1
python train_explorer.py \
--model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
--world_model_name_or_path meta-llama/Llama-3.2-1B \
--cosmos_tokenizer_pretrained_ckpts ./pretrained_ckpts \
--cosmos_tokenizer_model_name Cosmos-Tokenizer-CI16x16 \
--image_name sgccr.ccs.tencentyun.com/screenagent/screenagent:2.0 \
--actor_model_device "cuda:1" \
--save_checkpoint_interval 10
You can download the trained LoRA checkpoints from HuggingFace or train your own model as described above.
Evaluate base 3B model on 1 GPU:
cd src
export CUDA_VISIBLE_DEVICES=0
python online_eval.py --eval_episodes 20 --model_type vllm --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct --temperature 1.0
Evaluate checkpoint of 3B model on 1 GPU:
cd src
export CUDA_VISIBLE_DEVICES=0
python online_eval.py \
--eval_episodes 20 \
--model_type vllm \
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
--load_lora_weights logs/<path_to_your_experiment_checkpoint_dir>/episode_100/actor_model_100 \
--temperature 1.0
Evaluate checkpoint of 7B model on 1 GPU:
cd src
export CUDA_VISIBLE_DEVICES=0
python online_eval.py \
--eval_episodes 20 \
--model_type vllm \
--model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
--load_lora_weights logs/<path_to_your_experiment_checkpoint_dir>/episode_200/actor_model_200 \
--temperature 1.0
@misc{niu2025screenexplorertrainingvisionlanguagemodel,
title={ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World},
author={Runliang Niu and Jinglong Ji and Yi Chang and Qi Wang},
year={2025},
eprint={2505.19095},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.19095},
}