finetune

Hi, thank you for the excellent work on GPN — it's a really well-structured and efficient framework for genomic sequence modeling. I used it to pretrain a model on my own dataset with the GPNForMaskedLM architecture.

After training, my config.json looks like this:

{
"architectures": ["GPNForMaskedLM"],
"model_type": "GPN",
"vocab_size": 7,
"embedding": "one_hot",
"embedding_size": 768,
"encoder": "convnet",
"num_hidden_layers": 25,
"hidden_size": 512,
"pad_token_id": 0,
"max_position_embeddings": 1536,
...
}
My questions are:
1.Should I manually add "num_labels": 2 to the config.json before fine-tuning?

2.Is it sufficient to change "architectures" to "GPNForSequenceClassification" in the config, or is that automatically inferred?

3.Could you kindly provide an example fine-tuning command for a binary classification task, including required arguments such as --problem_type, dataset formatting, etc.?

If possible, could you also include a short explanation of how finetune.py loads the model/config/tokenizer and prepares the dataset?

Having an example would make it much easier to understand the full fine-tuning pipeline. Thank you again for your great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions