Question

gff file from NCBI RefSeq GCF dataset has an invalid format

0

Entering edit mode

2.4 years ago

Michael • 0

I found a gff file from NCBI datasets https://www.ncbi.nlm.nih.gov/datasets/ that appears to have a non-compliant formatting. I find lines where the start position is higher than the stop position. Here is an example line:

NC_007982.1     RefSeq  mRNA    691776  267232  .       ?       .       ID=rna-ZeamMp017;Parent=gene-ZeamMp017;Dbxref=GeneID:37545003;gbkey=mRNA;gene=nad1;locus_tag=ZeamMp017

Note that the 691776 in column 4 is greater than the 267232 in column 5. According to the gff3 spec at https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md this is not allowed. Consequently, this file cannot be loaded into my genome browser, (jBrowse2), which seems strict about the formatting.

The gff file came in the dataset for GCF_902167145.1 (Zea mays version 5).

My questions are:

Am I right that this is a mis-formatted gff file?
Has anyone seen this in other gff files from RefSeq / NCBI datasets? Is this a Refseq-wide issue or just an issue with this particular maize dataset?

Yes I know I can parse and remove or fix the defects with minimal scripting approaches. But, malformed gff files should be fixed at NCBI datasets I would think.

NCBI gff • 2.0k views

ADD COMMENT • link 2.4 years ago by Michael • 0

1

Entering edit mode

Am I right that this is a mis-formatted gff file

yes

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 166k

score 3 · Accepted Answer · 2023-01-31

Thank you for noticing this. It is indeed an issue in the GFF3 file.

The root of the problem is it’s a gene that is impossible to correctly represent in GFF3 because it incorporates sequence from both strands via trans_splicing. The complexity of this gene can be seen on the flatfile:

     gene            join(50490..50874,320928..322595,548714..548772,
                     complement(266974..267232))
                     /gene="nad1"
                     /locus_tag="ZeamMp186"
                     /trans_splicing
                     /db_xref="GeneID:4055939"
     CDS             join(50490..50874,320928..321010,322404..322595,
                     548714..548772,complement(266974..267232))
                     /gene="nad1"
                     /locus_tag="ZeamMp186"
                     /exception="RNA editing"
                     /trans_splicing

trans_splicing is a post-transcriptional process that combines parts of what start off as separate transcripts into a mature product, and can reorder exons, mix strands, and even combine exons from different genomic molecules. It occurs in many plant organelles. It’s not something the GFF3 spec covers, so we do what we can. The issue here is we recently added some logic to add virtual mRNA+exon features for organelles to make them more compatible with various tools (previously only the CDS rows were present on the organelles), but it looks like we have a bug in setting the mRNA range.

To compensate, you can try dropping just the mRNA rows for ZeamMp017, ZeamMp016 and ZeamMp019 from the GFF3 file. I am assuming that the GFF3 parser in jBrowse2 will be able to handle that and still load the file.