Description
Hi, thank you for the excellent work on GPN — it's a really well-structured and efficient framework for genomic sequence modeling. I used it to pretrain a model on my own dataset with the GPNForMaskedLM architecture.
After training, my config.json looks like this:
{
"architectures": ["GPNForMaskedLM"],
"model_type": "GPN",
"vocab_size": 7,
"embedding": "one_hot",
"embedding_size": 768,
"encoder": "convnet",
"num_hidden_layers": 25,
"hidden_size": 512,
"pad_token_id": 0,
"max_position_embeddings": 1536,
...
}
My questions are:
1.Should I manually add "num_labels": 2 to the config.json before fine-tuning?
2.Is it sufficient to change "architectures" to "GPNForSequenceClassification" in the config, or is that automatically inferred?
3.Could you kindly provide an example fine-tuning command for a binary classification task, including required arguments such as --problem_type, dataset formatting, etc.?
- If possible, could you also include a short explanation of how finetune.py loads the model/config/tokenizer and prepares the dataset?
Having an example would make it much easier to understand the full fine-tuning pipeline. Thank you again for your great work!