What Are, If Any, The Conventions For Encoding Extra Information In The Gff3 Ninth Column?
2
3
Entering edit mode
12.2 years ago
Michael Barton ★ 1.8k

The GFF3 format is well specified for describing sequence location and type in the first eight columns. The ninth column is left for specifying any remaining information. I would like to use GFF3 the encode the data typically produced by a genome annotator.

How should I go about this? Are there any conventions for encoding information such as protein product, EC number, and description?

gff annotation • 2.6k views
ADD COMMENT
3
Entering edit mode
12.2 years ago
Scott Cain ▴ 770

There are no conventions outside of the GFF3 spec, though I would note that the reserved tag "Note" has historically been used for descriptions in GBrowse, so if you're going to be using GBrowse, it makes sense to put descriptions there. Generally I suggest that you encode the information in such a way that it will make it easier to use in whatever your downstream application for it is. Of course, you may not always know what that is, but that's my best advice.

ADD COMMENT
0
Entering edit mode

Thanks for suggestion. Are there any common variants of GFF which include this type of information?

ADD REPLY
0
Entering edit mode

You mean that have protein information? Not that I can recall, though I can tell you that the GFF3 at NCBI has EC_number tags that violate the GFF3 spec (at least, they did the last time I checked). What is the end use that you have in mind?

ADD REPLY
0
Entering edit mode

Encoding additional genome annotation data in GFF3. Things like product and description.

ADD REPLY
0
Entering edit mode

Yes, but I meant, why do you want to do that? Who is it for, and what will they be using the GFF file for?

ADD REPLY
1
Entering edit mode
12.2 years ago
Neilfws 49k

You'll see from the spec that there are a few conventions for column 9: key-value pairs, "reserved" keys, characters that require escapes and some conventions regarding case.

Apart from that, the key-value pairs are whatever you wish. This is in large part because GFF3 files are used by GBrowse. Track display in GBrowse often relies on small Perl subroutines in the config file, which read customised attributes from column 9.

You might want to look at the Bioperl script bp_genbank2gff3.pl, which goes some way towards "standardising" the conversion of GenBank format to GFF3.

ADD COMMENT

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6