What Is The Correct Specification For Gff3?
9.7 years ago

I am retrieving some GFF3 files for Arabidopsis from this FTP site.

The issue is that a conversion script I use to turn these into another format is getting stuck on some lines having trailing semi-colons, and other lines not. For example, here are two lines which show the contrasting problem:

Chr1    TAIR9    five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1
Chr1    TAIR9    CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;

I can do the following to strip the semi-colon, no big deal:

$ awk '{gsub(/;$/,"");print}' TAIR9_GFF3_genes.gff | ./gff2foo

But I can also "fix" this long-term by editing the conversion script — and I want to address this, if the specification says this is "legal". I also don't want to introduce hacky fixes if this file is bogus.

What is the correct format for GFF3? Are trailing semi-colons allowed or are these broken GFF3 files?

9.7 years ago

Here is a detailed specs of GFF3: http://www.sequenceontology.org/gff3.shtml

It basically says fields in the 9th column should be delimited by a semi-colon, meaning no trailing semi-colon.

But as you may have found that not everyone follows it strictly.

Here is a GFF3 validator: http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

Thanks. The spec is ambiguous but suggests that a key-value pair is needed if there is a semi-colon. Your validator was useful. It seems to think the input file is illegal for the same reason (among others), so I think the conversion script is in line with the specification.


