8000 GitHub - niuzaisheng/ScreenExplorer: ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

niuzaisheng/ScreenExplorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

arXiv GitHub HuggingFace

We introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments for diverse exploration. ScreenExplorer is trained to explore and interact with the screen environment, learning to interact effectively with environments based on screenshots and a fixed instruction to encourage exploration.

Demo Videos

ScreenExplorer-3B-E1 Video

ScreenExplorer-3B-E1.mov

ScreenExplorer-7B-E1 Video

ScreenExplorer-7B-E1.mov

Project Structure

ScreenExplorer/
├── requirements.txt
└── src/
    ├── schema
    │   ├── action_selection_by_vlm_en.txt: The fixed instruction to encourage exploration
    │   ├── action_selection.py:            Action selection schema
    │   └── __init__.py
    ├── screen_env
    │   ├── asyncvnc.py:                    VNC client for screen interaction
    │   └── screen_env.py:                  Environment wrapper for screen-based interaction
    ├── train_explorer.py:                  Main training script for the explorer agent
    ├── exploration_reward.py:              Exploration rewards
    ├── online_eval.py:                     Online evaluation script
    ├── rollout_buffer.py:                  Manages experience rollouts for training
    ├── utils.py
    └── world_model.py:                     World model implementation

Preparation

  1. Download Cosmos-Tokenizer-CI16x16 pretrained checkpoint from here and put it in src/pretrained_ckpts/ directory.

  2. Make sure you have downloaded base model Qwen/Qwen2.5-VL-3B-Instruct or Qwen/Qwen2.5-VL-7B-Instruct from huggingface. And meta-llama/Llama-3.2-1B for world model.

  3. Setup docker environment for screen environment:

docker pull sgccr.ccs.tencentyun.com/screenagent/screenagent:2.0 # in global
# or 
docker pull ccr.ccs.tencentyun.com/screenagent/screenagent:2.0 # in China

Run Training

Train 3B model on 1 GPU:

cd src
export CUDA_VISIBLE_DEVICES=0
python train_explorer.py \
  --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
  --world_model_name_or_path meta-llama/Llama-3.2-1B \
  --cosmos_tokenizer_pretrained_ckpts ./pretrained_ckpts \
  --cosmos_tokenizer_model_name Cosmos-Tokenizer-CI16x16 \
  --image_name sgccr.ccs.tencentyun.com/screenagent/screenagent:2.0 \
  --save_checkpoint_interval 10

Train 7B model on 2 GPU:

cd src
export CUDA_VISIBLE_DEVICES=0,1
python train_explorer.py \
  --model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
  --world_model_name_or_path meta-llama/Llama-3.2-1B \
  --cosmos_tokenizer_pretrained_ckpts ./pretrained_ckpts \
  --cosmos_tokenizer_model_name Cosmos-Tokenizer-CI16x16 \
  --image_name sgccr.ccs.tencentyun.com/screenagent/screenagent:2.0 \
  --actor_model_device "cuda:1" \
  --save_checkpoint_interval 10

Run Online Evaluation:

You can download the trained LoRA checkpoints from HuggingFace or train your own model as described above.

Evaluate base 3B model on 1 GPU:

cd src
export CUDA_VISIBLE_DEVICES=0
python online_eval.py --eval_episodes 20 --model_type vllm --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct --temperature 1.0

Evaluate checkpoint of 3B model on 1 GPU:

cd src
export CUDA_VISIBLE_DEVICES=0
python online_eval.py \
--eval_episodes 20 \
--model_type vllm \
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
--load_lora_weights logs/<path_to_your_experiment_checkpoint_dir>/episode_100/actor_model_100 \
--temperature 1.0

Evaluate checkpoint of 7B model on 1 GPU:

cd src
export CUDA_VISIBLE_DEVICES=0
python online_eval.py \
--eval_episodes 20 \
--model_type vllm \
--model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
--load_lora_weights logs/<path_to_your_experiment_checkpoint_dir>/episode_200/actor_model_200 \
--temperature 1.0

Citation

@misc{niu2025screenexplorertrainingvisionlanguagemodel,
      title={ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World}, 
      author={Runliang Niu and Jinglong Ji and Yi Chang and Qi Wang},
      year={2025},
      eprint={2505.19095},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.19095}, 
}

About

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0