Question

gff3_sp_complement_annotations.pl parameter settings

0

Entering edit mode

4.2 years ago

Ric ▴ 430

I ran gff3_sp_complement_annotations.pl --ref stringtie.gff3 --add augustus.hints_utr.gff3 --out gaas.gff3 and got the below new IDs:

NbV1Ch01        transdecoder    five_prime_UTR  99705   99705   .       -       .       ID=nbis_NEW-five_prime_utr-4254;Parent=STRG2.t1
NbV1Ch01        transdecoder    three_prime_UTR 98185   98185   .       -       .       ID=nbis_NEW-three_prime_utr-2120;Parent=STRG2.t1
...
NbV1Ch01        AUGUSTUS        start_codon     112448  112450  .       -       0       ID=start_codon-70639;Parent=g65212.t1
NbV1Ch01        AUGUSTUS        stop_codon      109839  109841  .       -       0       ID=stop_codon-70662;Parent=g65212.t1

Additionally, I got the following massage:

**********************************************************************************************************************************************************
*                            Primary tag values (3rd column) not expected => transcription_start_site transcription_end_site                             *
*                                                   Those primary tag are not yet taken into account !                                                   *
*     If you wish to use it/them, pleast update the parameter feature json files accordingly (features_level1, features_level2 or features_level3).      *
*                                                                       To resume:                                                                       *
*                                                   - it must be a level1 feature if it has no parent.                                                   *
*                                    - it must be a level2 feature if it has a parent and this parent is from level1.                                    *
*                                  - it must be a level3 feature if it has a parent and this parent has also a parent.                                   *
*                                                                                                                                                        *
*                Currently the tool just ignore them, So if they where Level1,level2, a gene or RNA feature will be created accordingly.                 *
**********************************************************************************************************************************************************
**********************************************************************************************************************************************************
* Primary tag values (3rd column) not expected => transcription_start_site transcription_start_site transcription_start_site transcription_start_site tr *
*                      Those values are not compatible with gff3 format and the tool cannot guess to which one they correspond to.                       *
*                                     If you want to follow rigourously the gff3 format, please visit this website:                                      *
*                                      https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md                                       *
*                                                      They provide tools to check the gff3 format.                                                      *
*                            Even if you have this warning, you should be able to use the gff3 output in most of gff3 tools.                             *
**********************************************************************************************************************************************************

How GAAS find the new start_codon and end_codon?
What advantages or disadvantages it has to enable transcription_start_site and transcription_end_site?
How is it possible to enable it?

Thank you in advance,

UPDATE

I ran agat_sp_complement_annotations.pl --ref transdecoder.genome.Fix.gff3 --add augustus.hints_utr.gff3 --out augustus.hints_utrAGAT.gff3. This is the screenshot of an area where I would like keep only NBlab03G03860.1 and not NBlab03G03870.1, NBlab03G03880.1 and NBlab03G03890.1

enter image description here

Please find here the genes in question as GFF3.

How is it possible to remove the small the 3 small ones?

Thank you in advance

gene annotation • 1.5k views

ADD COMMENT • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

Please use agat_sp_complement_annotations.pl from AGAT instead. It contains some improvements.

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

Thank you, I used it.

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

Thank you, I noticed that agat_sp_complement_annotations.pl did not remove overlapping genes as shown in my updated question in the top. What did I miss?

Thank you in advance

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

They do not overlap. To be considered as overlapping they must overlap in their cds parts.

ADD REPLY • link 4.2 years ago by Juke34 8.5k

0

Entering edit mode

Thank you. By any chance, do you have a summary of rules when genes will be considered or not?

ADD REPLY • link 4.2 years ago by Ric ▴ 430

0

Entering edit mode

I already answered this question here A: intersecting two GFF3 missing data from the second file

ADD REPLY • link 4.2 years ago by Juke34 8.5k

score 0 · Answer 1 · 2020-01-31

Basically transcription_start_siteis not an official accepted term by the GFF3 specification. The real term to use here is TSS. You could replace this term with a sed command. For transcription_end_site this is correct but I forgot to add it in the tool (fixed in the GitHub repo soon). You can easily fix that.

from SOFA:

[Term]
id: SO:0000315
name: TSS
namespace: sequence
def: "The first base where RNA polymerase begins to synthesize the RNA transcript." [SO:ke]
subset: SOFA
synonym: "transcription start site" EXACT []
synonym: "transcription_start_site" EXACT []
is_a: SO:0000835 ! primary_transcript_region

This information is basically discarder during the file parsing.

To enable it:

agat_sp_gxf_to_gff3.pl --expose
nano features_level3.json # or use the text editor of your choice

and add those lines in the son file (/!\ if it is the last line it shouldn't contain any coma at the end):

"transcription_start_site":"exon",
"transcription_end_site":"exon",

Then you have to run agat_sp_complement_annotations.pl from the same place.

About the new IDs:

nbis_NEW-three_prime_utr it is either features that were not existing (missing) before that have been created while parsing, or the ID was not uniq and the parser fixed it. For more information run:
agat_sp_gxf_to_gff3.pl --gff file.gff and inspect the output warnings.