gff3_sp_complement_annotations.pl parameter settings
1
0
Entering edit mode
22 months ago
Ric ▴ 370

I ran gff3_sp_complement_annotations.pl --ref stringtie.gff3 --add augustus.hints_utr.gff3 --out gaas.gff3 and got the below new IDs:

NbV1Ch01        transdecoder    five_prime_UTR  99705   99705   .       -       .       ID=nbis_NEW-five_prime_utr-4254;Parent=STRG2.t1
NbV1Ch01        transdecoder    three_prime_UTR 98185   98185   .       -       .       ID=nbis_NEW-three_prime_utr-2120;Parent=STRG2.t1
...
NbV1Ch01        AUGUSTUS        start_codon     112448  112450  .       -       0       ID=start_codon-70639;Parent=g65212.t1
NbV1Ch01        AUGUSTUS        stop_codon      109839  109841  .       -       0       ID=stop_codon-70662;Parent=g65212.t1


Additionally, I got the following massage:

**********************************************************************************************************************************************************
*                            Primary tag values (3rd column) not expected => transcription_start_site transcription_end_site                             *
*                                                   Those primary tag are not yet taken into account !                                                   *
*     If you wish to use it/them, pleast update the parameter feature json files accordingly (features_level1, features_level2 or features_level3).      *
*                                                                       To resume:                                                                       *
*                                                   - it must be a level1 feature if it has no parent.                                                   *
*                                    - it must be a level2 feature if it has a parent and this parent is from level1.                                    *
*                                  - it must be a level3 feature if it has a parent and this parent has also a parent.                                   *
*                                                                                                                                                        *
*                Currently the tool just ignore them, So if they where Level1,level2, a gene or RNA feature will be created accordingly.                 *
**********************************************************************************************************************************************************
**********************************************************************************************************************************************************
* Primary tag values (3rd column) not expected => transcription_start_site transcription_start_site transcription_start_site transcription_start_site tr *
*                      Those values are not compatible with gff3 format and the tool cannot guess to which one they correspond to.                       *
*                                     If you want to follow rigourously the gff3 format, please visit this website:                                      *
*                                      https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md                                       *
*                                                      They provide tools to check the gff3 format.                                                      *
*                            Even if you have this warning, you should be able to use the gff3 output in most of gff3 tools.                             *
**********************************************************************************************************************************************************

1. How GAAS find the new start_codon and end_codon?
2. What advantages or disadvantages it has to enable transcription_start_site and transcription_end_site?
3. How is it possible to enable it?

UPDATE

I ran agat_sp_complement_annotations.pl --ref transdecoder.genome.Fix.gff3 --add augustus.hints_utr.gff3 --out augustus.hints_utrAGAT.gff3. This is the screenshot of an area where I would like keep only NBlab03G03860.1 and not NBlab03G03870.1, NBlab03G03880.1 and NBlab03G03890.1

Please find here the genes in question as GFF3.

How is it possible to remove the small the 3 small ones?

gene annotation • 538 views
0
Entering edit mode

Please use agat_sp_complement_annotations.pl from AGAT instead. It contains some improvements.

0
Entering edit mode

Thank you, I used it.

0
Entering edit mode

Thank you, I noticed that agat_sp_complement_annotations.pl did not remove overlapping genes as shown in my updated question in the top. What did I miss?

0
Entering edit mode

They do not overlap. To be considered as overlapping they must overlap in their cds parts.

0
Entering edit mode

Thank you. By any chance, do you have a summary of rules when genes will be considered or not?

0
Entering edit mode

I already answered this question here A: intersecting two GFF3 missing data from the second file

0
Entering edit mode
22 months ago
Juke34 ★ 6.4k

Basically transcription_start_siteis not an official accepted term by the GFF3 specification. The real term to use here is TSS. You could replace this term with a sed command. For transcription_end_site this is correct but I forgot to add it in the tool (fixed in the GitHub repo soon). You can easily fix that.

from SOFA:

[Term]
id: SO:0000315
name: TSS
namespace: sequence
def: "The first base where RNA polymerase begins to synthesize the RNA transcript." [SO:ke]
subset: SOFA
synonym: "transcription start site" EXACT []
synonym: "transcription_start_site" EXACT []
is_a: SO:0000835 ! primary_transcript_region


This information is basically discarder during the file parsing.

To enable it:

agat_sp_gxf_to_gff3.pl --expose
nano features_level3.json # or use the text editor of your choice


and add those lines in the son file (/!\ if it is the last line it shouldn't contain any coma at the end):

"transcription_start_site":"exon",
"transcription_end_site":"exon",


Then you have to run agat_sp_complement_annotations.pl from the same place.

nbis_NEW-three_prime_utr it is either features that were not existing (missing) before that have been created while parsing, or the ID was not uniq and the parser fixed it. For more information run:
agat_sp_gxf_to_gff3.pl --gff file.gff and inspect the output warnings.