Question: Convert Braker2 gff3 to EMBL flat file for ENA submission
0
gravatar for m.eitel
6 weeks ago by
m.eitel0
m.eitel0 wrote:

Hi!

How can I transfer the output gff3 of the Braker2 ab initio gene annotation pipeline to a valid EMBL flat file that I can submit to to ENA?

I tried using EMBLmyGFF3 (https://github.com/NBISweden/EMBLmyGFF3). To tool seems working fine, but the BRAKER gff3 seems to be non-standard and I am always getting error mesasges like:

13:59:52 WARNING feature: Partial CDS. The CDS with ID=g5848.t1.braker.CDS2 not a multiple of three.

This is the part of the braker2 gff3 it refers to:

scaffold_001 AUGUSTUS gene 2081205 2082079 1 + . ID=g5848.braker; scaffold_001 AUGUSTUS mRNA 2081205 2082079 1 + . ID=g5848.t1.braker;Parent=g5848.braker scaffold_001 AUGUSTUS start_codon 2081205 2081207 . + 0 Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081205 2081252 1 + 0 ID=g5848.t1.braker.CDS1;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081205 2081252 . + . ID=g5848.t1.braker.exon1;Parent=g5848.t1; scaffold_001 AUGUSTUS intron 2081253 2081594 1 + . Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081595 2081656 1 + 0 ID=g5848.t1.braker.CDS2;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081595 2081656 . + . ID=g5848.t1.braker.exon2;Parent=g5848.t1; scaffold_001 AUGUSTUS intron 2081657 2081747 1 + . Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081748 2081820 1 + 1 ID=g5848.t1.braker.CDS3;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081748 2081820 . + . ID=g5848.t1.braker.exon3;Parent=g5848.t1; scaffold_001 AUGUSTUS intron 2081821 2081890 1 + . Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081891 2082079 1 + 0 ID=g5848.t1.braker.CDS4;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081891 2082079 . + . ID=g5848.t1.braker.exon4;Parent=g5848.t1; scaffold_001 AUGUSTUS stop_codon 2082077 2082079 . + 0 Parent=g5848.t1.braker;

I am basically getting this error for all genes...

Any suggestions are highly appreciated.

Michael

gene assembly genome • 137 views
ADD COMMENTlink modified 6 weeks ago by Juke-342.2k • written 6 weeks ago by m.eitel0

You might also try GAG (https://github.com/genomeannotation/GAG) followed by tbl2asn (https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/) to try and submit to NCBI, but you will probably still get the same warnings

ADD REPLYlink written 6 weeks ago by jean.elbers1.1k
0
gravatar for colindaven
6 weeks ago by
colindaven1.2k
Hannover Medical School
colindaven1.2k wrote:

This is a warning, not an error. A CDS should be divisible by three because codons are 3bp, and CDS consist of codons.

I am not sure if CDS will always be annotated in 3bp codons, for example when lncRNAs are annotated.

Have you checked the sequences and looked at the annotation, eg in IGV? Do the CDS sequences look valid ? Are the first and last codons always found and displayed correctly ?

I hope it is also not a 0 based vs 1 based error, but it should not be.

ADD COMMENTlink written 6 weeks ago by colindaven1.2k

It's not uncommon for Braker to output partial CDS, but they do not make any biological sense indeed. If they are not pseudogenes you will need to amend this before submitting as those will not be accepted by the public repo's.

lncRNAs should also not have CDS assigned to them as they are nono-coding (hence the name) and will this not produced a protein and the CDS is pointless for them

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by lieven.sterck4.8k

Some lncRNAs contain (micro)ORFs which can make perfect biological sense.

For example: https://www.sciencedirect.com/science/article/pii/S0968000416300317

ADD REPLYlink written 6 weeks ago by colindaven1.2k

true for the warning. However, I also got a bunch of ERROR messages:

13:59:49 ERROR feature: >>stop_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.

13:59:49 ERROR feature: >>start_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.

13:59:51 ERROR feature: >>inferred_parent<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by m.eitel0

Ah, then you'll have to check the EMBL feature types which are allowed. Also check a few existing EMBL files for examples.

ADD REPLYlink written 6 weeks ago by colindaven1.2k

Hi, I'm a developer of EMBLmyGFF3. You can ignore those features (stop_codon, start_codon...), they are not useful. inferred_parent is created by the bcbio-gff python gff parser when a parent feature is missing. This is generally not a good sign. Do you have many of those inferred_parent warnings ?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Juke-342.2k

when loading the gff into a visualization software (Geneious in my case) the CDS seem normal.

Just wondering if this is a braker/augustus bug?! or a non-standard gff3

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by m.eitel0

Can you check at the nucleotide level that the start position indicated in the gff is really ATG and codon stop one of the accepted stop codon ? If it's a 0 based vs 1 based problem you should be able to find it out easily.

As said by @lieven.sterck it's not uncommon to get fragmented predicted genes, but if you have many of them it's really suspect. Was your assembly ultra fragmented ?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Juke-342.2k
0
gravatar for Juke-34
6 weeks ago by
Juke-342.2k
Sweden
Juke-342.2k wrote:

In your example your CDS is definitely multiple of three. The problem could come from something else. Could be due to a bug in the output format.

I mean all features level3 (exon, CDS, intron, stop_codon, etc) refer to g5848.t1 parental feature but this feature doesn't exits. Indeed the one is g5848.t1.braker.
So either add .braker to all sub-features or remove it from the mRNA ones.

It explains at least why you have then inferred_parent features appearing from nowhere ....

Try to fix that first, maybe it will solve the other problem too.

ADD COMMENTlink written 6 weeks ago by Juke-342.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour