Description
Greetings,
I am interested in computing accuracy metrics for a query GFF against a reference. The reference/query files have both CDS and exon features. I want to perform accuracy calculations using strict terminal boundaries, operating on CDS specifically.
I did some testing and made the following observations:
- Exon features are prioritized for accuracy metrics, but CDS can still be used. That is, removing exon rows changes the accuracy values when calculated from CDS+exon, but removing CDS rows does not.
- The -e parameter reads "max. distance (range) allowed from free ends of terminal exons of reference transcripts." Setting -e 0 in CDS file only changes exon-level accuracy metrics. Transcript/locus level are unchanged. Sensitivity/Precision unsurprisingly dips slightly with -e 0
- In the documentation under transcript description, but not in the parameter list: "Using --strict-match option can make the accuracy estimation at this level much more stringent by only allowing a limited variation of the outer coordinates of the terminal exons (by at most 100 bases by default, but this value can be changed with the -e option)." When I set --strict-match -e 0, the exon/intron level remains the same relative to -e 0, but intron-chain/transcript/locus level all decrease.
Given the observations above I think --strict-match -e 0 is the correct way to stringently compare CDS. Do you agree, or maybe have a different suggestion? The parameter --strict-match isn't clearly described in the documentation. By "only allowing a limited variation of the outer coordinates of the terminal exons" , does this mean when running default gffcompare (--strict-match is not specified) then terminal exon boundaries can be extremely different so long as they have matching intron chains?
As an aside, I am not sure why one would want to calculate accuracy using -e 0 alone, which allows some fuzziness in the parent-level features but is strict at exon/intron level. Have you observed any specific use cases for this?
Thanks,
Eric