Question

Standardizing 3rd party GFF3 file

1

Entering edit mode

4.2 years ago

Anand Rao ▴ 640

I have a problem of trying to use a GFF3 file from a community resource - there are certainly elements that do not look properly formatted, and other cases I am not too sure if they are OK or not.

Rather than guessing whether tweaks will be compatible with my downstream bioinformatics steps, I seek suggestions regarding which lines to modify and in what manner. I have some ideas I seek your feedback on, but first some context below:

Download file -

wget https://medicago.toulouse.inra.fr/MtrunA17r5.0-ANR/downloads/1.7/MtrunA17r5.0-ANR-EGN-r1.7.structuralAnnotation.zip

Uncompress -

unzip MtrunA17r5.0-ANR-EGN-r1.7.structuralAnnotation.zip

Remove commented lines -

grep -v "^#" MtrunA17r5.0-ANR-EGN-r1.7.gff3 > MtrunA17r5.0-ANR-EGN-r1.7_NOcomments.gff3 
wc -l MtrunA17r5.0-ANR-EGN-r1.7.gff3
  936120 MtrunA17r5.0-ANR-EGN-r1.7.gff3
wc -l MtrunA17r5.0-ANR-EGN-r1.7_NOcomments.gff3
  936073 MtrunA17r5.0-ANR-EGN-r1.7_NOcomments

Count number of types of features/methods in GFF3 file -

awk {'print $2'} MtrunA17r5.0-ANR-EGN-r1.7_NOcomments.gff3 | sort -V | uniq -c
220428 BLASTN
 584 BioFileConverter
535992 EuGene
 879 HelitronScanner
23075 LTRdigest
50990 LTRharvest
94880 TIRvish
  80 Tephra
3790 ope_rescue
5375 smallA

One line each for each from this GFF3, for each of these features / methods-

MtrunA17CP  BLASTN  repeat_region   63  208 0   -   .   ID=DTT_singleton_family894_terminal_inverted_repeat_element12995_MtrunA17Chr7_27512288_27521410_fragment_MtrunA17CP-1;Name=MtrunA17CPR0000010;locus_tag=MtrunA17_CPR0000010

MtrunA17Chr1    BioFileConverter    gene    27582277    27586500    .   -   .   ID=1ae4e465-f44f-4ed1-be05-6d36bee93e92;old_name=Medtr1g054635.1;Name=MtrunA17Chr1g0176054;product=Putative oxidoreductase;date_creation=2019-03-29;date_last_modified=2019-12-09;owner=pascal.gamas@inra.fr;status=Validated;locus_tag=MtrunA17_Chr1g0176054

MtrunA17CP  EuGene  gene    1   977 .   +   .   ID=gene:MtrunA17CPg0492171;Name=MtrunA17CPg0492171;locus_tag=MtrunA17_CPg0492171

MtrunA17Chr0c01 HelitronScanner repeat_region   558455  569669  .   -   .   ID=helitron1;family=DHH_singleton_family0;Ontology_term=SO:0000544;Name=MtrunA17Chr0c01R0004190;locus_tag=MtrunA17_Chr0c01R0004190;biotype=helitron

MtrunA17Chr0c01 LTRdigest   RR_tract    5825    5835    .   -   .   ID=0af6b1d2af09ab290c93ef7592c9495e;Parent=TRIM_retrotransposon1

MtrunA17Chr0c01 LTRharvest  repeat_region   5328    7072    .   -   .   ID=repeat_region1;Name=MtrunA17Chr0c01R0000090;locus_tag=MtrunA17_Chr0c01R0000090

MtrunA17CP  TIRvish repeat_region   55606   55851   .   .   .   ID=repeat_region16058;Name=MtrunA17CPR0000190;locus_tag=MtrunA17_CPR000019

MtrunA17Chr1    Tephra  repeat_region   2778995 2781510 .   -   .   ID=non_LTR_retrotransposon0;family=RIL_singleton_family0;Ontology_term=SO:0000189;Name=MtrunA17Chr1R0011840;locus_tag=MtrunA17_Chr1R0011840;biotype=non_LTR_retrotransposon

MtrunA17Chr0c01 ope_rescue  gene    3575    7626    .   -   .   ID=gene:MtrunA17Chr0c01g1039196;Name=MtrunA17Chr0c01g1039196;locus_tag=MtrunA17_Chr0c01g1039196

MtrunA17CP  smallA  gene    21763   21870   .   -   .   ID=gene:MtrunA17CPg0492352;Name=MtrunA17CPg0492352;Note="seed:204DVAAXX:1:217:367:218 free_energy:-64.00 sstruct:(((((((((.(((((...((((((..((((..((..((((((..((((....))))...))))))..))..))))))))))...))))).)))........)))))).";locus_tag=MtrunA17_CPg0492352

Problems and My Proposed Solutions seeking your feedback -

1. Column 3 uses repeat_region for TEPHRA, LTRharvest and TIRvish, but RR_tract for LTRdigest. So homogenize all predictions of repeats as repeat_region in column 3 / ?

2. When (column2==BioFileConverter), final column's ID=garbled. Can I just replace ID definition with Name definition ?

3. When (column2== ope_rescue || smallA), should I strip away 'gene:' from ID= definition? Or should I add this string, for any gene definition line, if it is missing ?

4. When (column2==LTRdigest), again ID=garbled. I am not sure if this can be salvaged, but if it needs instead to be given user based new definitions, like ID=LTRdigest$n - where $n is running integer based on occurrence # ?

5. In the example lines here, you can see the LTRdigest coords of 5825-5835 overlap with the LTRharvest coords of 5328-7072. Naive question - this should not be a problem IMO, am I right?

6. Any other problems you see in my example GFF3 lines, or in the full file (hence the link) that I have missed ?

Validation worked! Please note that validation using genometools' gt gff3validator worked on this file, despite all these issues I detect (and which have indeed raised run errors with tools such as featureCounts)

This is why I need your help: As I am sure you appreciate, modifying each perceived problem in a certain way could become a combinatorial rabbit hole and I do not want to chase my own tail in that warren. So any general guidelines, as well as specific response to my questions above would be greatly appreciated. Thanks, and be safe!

GFF3 format validation • 1.4k views

ADD COMMENT • link 4.2 years ago by Anand Rao ▴ 640

2

Entering edit mode

Are those logical solutions that you are proposing for validation or just workarounds? Now you know why they say that a $1K genome needs $100K for annotation.

ADD REPLY • link 4.2 years ago by GenoMax 146k

0

Entering edit mode

Probably workarounds, because real solutions would in the strict sense ideally need to use only controlled vocabulary approved by SequenceOntology for GFF3 etc., and other such restrictions, right?

note to self: (SO link, for GFF3, paper, websearchPage)?

BTW, repeating an important point in my OP, the file as it is validates OK, so that gt gff3validator step did not contribute much value to this exercise.

On a related point, I did try using AGAT, specifically agat_sp_statistics.pl --gff MtrunA17r5.0-ANR-EGN-r1.7_NOcomments.gff3 -o MtrunA17r5.0-ANR-EGN-r1.7_NOcomments_AGAT_sp_statistics.txt -v 0 - it spewed a lot of lines despite lowest STDOUT verbosity, but it just hanging without producing the final output statistics file (probably an install problem? not sure...)

I've sought help from the annotation group directly as well, and they've been quite helpful with related matters thus far, so I am quite hopeful they'll help resolve this (in fact I've shared this link with them)

I've also tagged the AGAT author on this post. Hopefully across all these players I'll have resolution in terms of a validated GFF3 file that I can then use for mapping, read count generation etc. Thanks for your comments, keep them coming! :)

ADD REPLY • link 4.2 years ago by Anand Rao ▴ 640

0

Entering edit mode

Which AGAT version do you use? I would suggest to install the most up-to-date in the master branch. Your file sounds quite complex. You should run agat_convert_sp_gxf2gxf.pl first to point the potential problem. I think you will have to "drive" AGAT to behave properly.
I would probably split your file based one sources (2nd column value) using awk and then try to fix independently each type of sources with agat_convert_sp_gxf2gxf.pl.

ADD REPLY • link 4.2 years ago by Juke34 8.8k

0

Entering edit mode

Thanks for replying. I am using AGAT Version: v0.4.0 (which seems to be your latest version from 4 Jun, 2020)

The annotation group updated their GTF, and with that new file, I followed your advice and split it based on source (2nd column). On each of 4 sources, I ran

agat_convert_sp_gxf2gxf.pl -g Assembly_Source$n.gtf -v 1 -o Assembly_Source$n_AGAT.gtf

where $n = annotation source #

Then I concatenated these individual AGAT outputs, with a simple

cat Assembly_Source*_AGAT.gtf > Assembly_Concat_AGAT.gtf

Now I need to sort this concatenated GTF file, and I wonder if I can use AGAT again for this? I've asked you a question about that at an earlier post by you at this link. Look fwd to your response. Thanks, in advance.

ADD REPLY • link 4.2 years ago by Anand Rao ▴ 640