Question

EMBL-api-validator -- WARNING: "exon" usually expected to be at least "15" nt long.

0

Entering edit mode

5.3 years ago

User 4014 ▴ 40

Hi folks,

I am preparing a flat file of a fungal genome for submission to EMBL. I used EMBL-API_validator-1.1.263 to check the flat file and got warnings "WARNING: "exon" usually expected to be at least "15" nt long. Please check the accuracy. ". May you please advise how to fix it? Or should I just ignore it since it is just a warning?

Thank you very much in advance!

genome embl submission • 2.0k views

ADD COMMENT • link updated 5.3 years ago by Juke34 8.5k • written 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

Do you have predicted exons that are < 15 nt in your file? They don't make a lot of sense at that length.How did you do the annotation? How was the genome assembled?

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Hmm..it seems like so. One of the three exons (last one) is only 6 bases. The predicted protein itself is 63 aa though. Do you have a suggestion how should I treat this exon and the corresponding protein? It is a draft genome, assembled with Spades 3.10 and annotated using FUNGAP (https://github.com/CompSynBioLab-KoreaUniv/FunGAP). But for this protein, FUNGAP inherited the model from MAKER.

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

In case of this specific protein what is it most similar to when you do a blastp search? You may want/need to adjust the annotations (sequences) based on the homology you see.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

I did blastp and the only match is a response regulator (1118 aa) from a bacterium Maricaulis salignorans. I should have done blastp, not only interproscan. Perhaps it's better to remove this predicted protein?

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

Is the blastp match consistent over majority of the protein sequence excluding this bit at the end? Perhaps that part of the prediction is incorrect.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

The identity between the two proteins is only 36% and they are much different in sizes (63 aa vs 1118 aa). Actually it is not only one model that got the problem. Took some more for blastp and they seem correct with nice alignment to other proteins, but still contain a tiny tail of 2-3 aa at the end to generate the warning. I guess it is something to do with the genome as it is only at draft stage. But for the weird one, perhaps I can put it as a hypothetical protein and leave it there in the annotation?

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

So you have not done any diligence to check/correct for these errors? Obviously the software is wrong in this case.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

The pipeline does blastp/interproscan/blastn/BUSCO before choosing the best models. So I thought it is sufficient. Plus I don't really know how to check the models. May you suggest?

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

FunGAP performs gene prediction on given genome assembly and RNA-seq reads.

To confirm is this the kind of data you are processing?

Did you pre-filter your SPAdes contigs to eliminate small/redundant pieces? This pipeline must be taking all sequences you provided to it when doing predictions. Did you look at the logs to see what the BUSCO results were suggesting (in terms of completeness of sequence)?

Correct annotation is hard. Pipeline did what it was programmed to do but the results have to be vetted. Does the genome you are submitting have close relatives available? Were those considered in annotation comparison?

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Yes, I did nucmer to remove repetitive contigs <500 bp before feeding the genome to FUNGAP, but I did not look at the BUSCO results though since I thought FUNGAP would sieve out those small and highly fragmented models.

There is a genome of close relatives. I think the plan is .. I will take a look at the BUSCO results, search for those with short exon, and kick those with low completeness out if they are not present in the genome of the closely related species. :) Please suggest if there are additional steps I should consider.

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

score 0 · Answer 1 · 2019-01-19

0

Entering edit mode

5.3 years ago

Juke34 8.5k

Hi, The best would be to manual curate it using a genome browser (Webapollo, Tablet or Artemis). If you have few of them it's doable. Otherwise you can anyway submit the data with the WARNINGs it will work. I often have that kind of cases when the prediction is abinitio. Indeed you can have an exon of 100 bases with only 10 bases that are coding and the rest is part of the UTR. Most of the abinitio will not predict the UTR part. So you end up with a short coding exon.

When I have plenty of short exons I use a script to extend the extremity ones (Indeed you cannot easily touch the internal ones) over the minimum size accepted by ENA, by adding arbitrary UTRs. If you are interested I can tell you where to find the script.

ADD COMMENT • link 5.3 years ago by Juke34 8.5k

0

Entering edit mode

Good to hear that you also experienced similar problem. The script would be fantastic. May you share me the script, please?

Thanks in advance!

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

You will have to git clone this repository https://github.com/NBISweden/GAAS and follow the installation procedure. Then use the script gff3_sp_fix_small_exon_from_extremities.pl.

ADD REPLY • link 5.3 years ago by Juke34 8.5k

0

Entering edit mode

Thanks again Juke! :)

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

Otherwise you can anyway submit the data with the WARNINGs it will work.

Putting bad data into databases multiplies this problem since someone else down the road finds a hit to this erroneous entry and keeps propagating it.

I would rather see people submit no annotation if they have not done enough work on it. Have NCBI's automated pipelines do annotation since they probably have checks and balances built in. You can email NCBI to ask them to annotate a eukaryotic genome you want to eventually make public. Unfortunately EBI does not have the resources to do this on your side of the ocean.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

I will carefully examine those models with short exon and will keep bad ones out of the annotation. But for those with proper/reasonable blastp hits, I will use the script to correct them. :)

ADD REPLY • link 5.3 years ago by User 4014 ▴ 40

0

Entering edit mode

Yes I agree, but as I explained it is not necessarily wrong. Lot of chance that the exon is short because it does include only the coding part. He has to investigate the annotation to better understand what's is happening.

ADD REPLY • link 5.3 years ago by Juke34 8.5k