Description
TLDR: If you cannot produce any fusions, it may be an issue with your GTF file.
The GTF file:
a) cannot use _
in chromosome names
b) in column 9, must have a gene_name
field
c) in column 9, must have either gene_type
or gene_biotype
field
If you are not getting any fusions and you believe your GTF file abides by a) and b)....I believe you can make the LongGF more permissive to possibly handle c if you specify the pseudogene argument (set to something that is NOT 0 or 1). This will cause LongGF to use more GTF entries even if they are a pseudogene.
[pseudogene:0(default)/1]: Optional. Default=0: Not use pseudogene from the GTF file.
[Secondary_alignment:0(default)]: Optional. Default=0: not use secondary alignment.
My workaround was different as detailed below.
Background
I was testing using LongGF using a human NCBI reference and GTF from https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/ GRCh38.p14 (downloaded the "Genome sequences" ie GCF_000001405.40_GRCh38.p14_genomic.fna
and "Annotation features" ie genomic.gtf). I noticed that no fusions were detected. Since LongGF does not tolerate _
in chromosome names, I had changed my sorted bam and gtf file to use the "chr" chromosome naming convention instead of of the RefSeq naming convention ie NC_
. However, I still could not produce any fusions.
Issue
From my investigation, when LongGF processes the GTF file it creates a _sub_list
for each gene depending on the gene_type
which is abstracted from the gtf file. The gene_type
is determined by the fields gene_type
or gene_biotype
in the gtf file. A _sub_list
is only created if the gene_type
is one of:
const char * gt_list2[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","disrupted_domain"};
from get_gfFrombam.c
.
Since the gtf file I was using lacks gene_type
and gene_biotype
, the _sub_list
never gets created and is normally necessary for determining possible gene fusions candidates. _sub_list is iterated through in _gtf_struct_.c
-->_gene_entry_
::get_coding_ovlp
to determine _t_code_reg_ovlp
in get_gfFrombam.c
.
This is why I did not produce any fusions.
Additionally, the gtf file I was using lacked the gene_name
field.
Resolution
The gtf file does contain transcript_biotype
and gbkey
fields which seem that they could be used in a similar fashion to gene_type
or gene_biotype
. Note - I modified some of the code in a branch and can create a pull request if easier.
Possible values are:
transcript_biotype "miRNA"
transcript_biotype "mRNA"
transcript_biotype "ncRNA"
transcript_biotype "rRNA"
transcript_biotype "scRNA"
transcript_biotype "snoRNA"
transcript_biotype "snRNA"
transcript_biotype "transcript"
transcript_biotype "tRNA"
gbkey "CDS"
gbkey "Gene"
gbkey "mRNA"
gbkey "ncRNA"
gbkey "rRNA"
gbkey "tRNA"
I added the following in _gtf_struc.c
while (std::getline(oss, cur_g_str, ' ')){
cur_g_str = str_tolower(cur_g_str); // revise to tolerate lowcase and uppercase
if (cur_g_str.compare("gene_id")==0){
std::getline(oss, cur_g_id, ' ');
cur_g_id = cur_g_id.substr(1, cur_g_id.size()-3);
}else if (cur_g_str.compare("transcript_id")==0){
std::getline(oss, cur_t_id, ' ');
cur_t_id = cur_t_id.substr(1, cur_t_id.size()-3);
// MODIFICATION added gene field since new gtf replaced gene_name with gene
}else if (cur_g_str.compare("gene_name")==0 || cur_g_str.compare("gene")==0){
std::getline(oss, cur_g_name, ' ');
cur_g_name = cur_g_name.substr(1, cur_g_name.size()-3);
// MODIFICATION: added || cur_g_str.compare("gbkey")==0 || cur_g_str.compare("transcript_biotype"==0
// MODIFICATION made since the new gtf from NCBI does not have gene_type or gene_biotype so cur_g_type = '' which causes is_g to be false and _sub_list never gets created from the initial gene entry so each gene does not have a _sub_list in m_gene_list. _sub_list is iterated through in _gtf_struct_.c-->_gene_entry_::get_coding_length to determine _t_code_reg_ovlp in get_gfFrombam.c. Since isg_i is false due to lack of gene_type/gene_biotype in the gtf, _sub_list never gets created for each gene and therefore _t_code_reg_ovlp is 0. If _t_code_reg_ovlp is 0 for a given gene entry there is no potential fusion detected and no fusions are detected. Also created gt_list3 in get_gfFrombam.c.
}else if (cur_g_str.compare("gene_type")==0 || cur_g_str.compare("gene_biotype")==0 || cur_g_str.compare("gbkey")==0 || cur_g_str.compare("transcript_biotype")==0){// add gene_biotype to consider other types
std::getline(oss, cur_g_type, ' ');
cur_g_type = cur_g_type.substr(1, cur_g_type.size()-3);
and in get_gfFrombam.c
:
int m_check_gene_fusion(const char* in_bam_file, const char* in_gtf_file, const int min_len_ovlp, const int64_t _bin_size, const int _used_pseudogene, const int64_t _min_map_len, const int _used_secondary_alignment, const int _min_sup_read, const int output_flag){
std::map<std::string, std::map<std::string, std::shared_ptr<_gtf_entry_> > > m_gene_list;
std::map<std::string, std::string> _gid_to_gn;
const char * gt_list1[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","IG_pseudogene","IG_C_pseudogene","IG_J_pseudogene","IG_V_pseudogene","TR_V_pseudogene","TR_J_pseudogene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","pseudogene","processed_pseudogene","polymorphic_pseudogene","transcribed_processed_pseudogene","transcribed_unprocessed_pseudogene","transcribed_unitary_pseudogene","translated_processed_pseudogene","translated_unprocessed_pseudogene","unitary_pseudogene","unprocessed_pseudogene","disrupted_domain"};
int gt_size1 = 30;
const char * gt_list2[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","disrupted_domain"};
int gt_size2 = 14;
//MODIFICATION commented the below
//int gt_size = gt_size2;
//MODIFICATION added gt_list3 lines below
const char * gt_list3[100] = {"IG_C_gene","IG_D_gene","IG_J_gene","IG_LV_gene","IG_V_gene","TR_C_gene","TR_J_gene","TR_V_gene","TR_D_gene","nonsense_mediated_decay","non_stop_decay","protein_coding","ambiguous_orf","disrupted_domain", "mRNA", "transcript", "CDS", "Gene"};
int gt_size3 = 18;
int gt_size = gt_size3;
const char ** gt_list;
if (_used_pseudogene==1){
gt_size = gt_size1;
gt_list = gt_list1;
}else if (_used_pseudogene==0){
gt_size = gt_size2;
gt_list = gt_list2;
//MODIFICATION added the chunk below
}else if (_used_pseudogene==2){
gt_size = gt_size3;
gt_list = gt_list3;
}else{// revise to consider all without filter.
gt_size = 0;
gt_list = NULL;
}
int retv = read_gtf(m_gene_list, _gid_to_gn, in_gtf_file, gt_list, gt_size);
//MODIFICATION changed the default value of _used_pseudogene from 0 to 2
int _used_pseudogene = 2;