Question

Issue about generating EMBL Flat file for ENA submission

3

Entering edit mode

11 months ago

Shakunthala Natarajan ▴ 50

Hello all! I am trying to generate an EMBL flat file to submit an annotated assembly to ENA. I am using EMBLmyGFF3 to generate the flat file from the whole genome FASTA file and the GFF3 file. I am getting two errors and a common warning which are:

Errors:

17:17:17 ERROR feature: >>start_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
                        Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
17:17:17 ERROR feature: >>stop_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
                    Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.

Warnings:

17:17:43 WARNING EMBLmyGFF3: Sequence NODE_446_length_99_cov_479.909091 too short (99 bp)! Minimum accpeted by ENA is 100, we skip it !
17:17:43 WARNING EMBLmyGFF3: Sequence NODE_447_length_99_cov_30.409091 too short (99 bp)! Minimum accpeted by ENA is 100, we skip it !
17:17:43 WARNING EMBLmyGFF3: Sequence NODE_448_length_98_cov_103.285714 too short (98 bp)! Minimum accpeted by ENA is 100, we skip it !
17:17:43 WARNING EMBLmyGFF3: Sequence NODE_449_length_98_cov_59.095238 too short (98 bp)! Minimum accpeted by ENA is 100, we skip it !
17:17:43 WARNING EMBLmyGFF3: Sequence NODE_450_length_98_cov_49.000000 too short (98 bp)! Minimum accpeted by ENA is 100, we skip it !
17:17:43 WARNING EMBLmyGFF3: Sequence NODE_451_length_98_cov_39.142857 too short (98 bp)! Minimum accpeted by ENA is 100, we skip it !

Can someone please help me address the specified error? Is there any way to handle the warnings and include the short sequences as well?

Thank you!

EMBL ENA EMBLmyGFF3 • 776 views

ADD COMMENT • link updated 8 months ago by polag01 ▴ 10 • written 11 months ago by Shakunthala Natarajan ▴ 50

score 2 · Accepted Answer · 2023-05-06

2

Entering edit mode

11 months ago

Juke34 8.5k

Is your plan to submit your file to ENA? if yes then you do not need start/stop codon. For the short sequences some people concatenate them in an unknow chromosome adding Ns to separate them. But doing this you need to redo the annotation in order to have the features’ locations synchronized with the new sequence.

ADD COMMENT • link 11 months ago by Juke34 8.5k

0

Entering edit mode

Thank you. This helps!

ADD REPLY • link 11 months ago by Shakunthala Natarajan ▴ 50

0

Entering edit mode

Juke34 After so many efforts, I got the tool to work. Thank you. In my own case, I am not submitting the sequences to ENA, but to create a repeats library to be concatenated with RepeatMasker library. I need to have all the repeats model represented, so 100bp cannot be a limit that is acceptable. Is there a work around to avoid this. I do not want to pad the sequence with Ns, it wouldn't make any sense. Thanks for your feedback in advance.

I also would like to point out that how to use the --accession switch is a little confusing the way it is written in the documentation. stating that it is a Boolean data type presupposes that a value is supplied with the argument {True | False}. I had tried this many times with no success until I used only the -a without any argument. I think it needs to be stated explicitly that no argument must be supplied to the parameter. My thoughts.

ADD REPLY • link 8 months ago by polag01 ▴ 10