Question: EMBL-api-validator -- WARNING: "exon" usually expected to be at least "15" nt long.
0
gravatar for User 4014
4 weeks ago by
User 401420
Sweden
User 401420 wrote:

Hi folks,

I am preparing a flat file of a fungal genome for submission to EMBL. I used EMBL-API_validator-1.1.263 to check the flat file and got warnings "WARNING: "exon" usually expected to be at least "15" nt long. Please check the accuracy. ". May you please advise how to fix it? Or should I just ignore it since it is just a warning?

Thank you very much in advance!

submission embl genome • 231 views
ADD COMMENTlink modified 4 weeks ago by Juke-341.8k • written 4 weeks ago by User 401420

Do you have predicted exons that are < 15 nt in your file? They don't make a lot of sense at that length.How did you do the annotation? How was the genome assembled?

ADD REPLYlink written 4 weeks ago by genomax62k

Hmm..it seems like so. One of the three exons (last one) is only 6 bases. The predicted protein itself is 63 aa though. Do you have a suggestion how should I treat this exon and the corresponding protein? It is a draft genome, assembled with Spades 3.10 and annotated using FUNGAP (https://github.com/CompSynBioLab-KoreaUniv/FunGAP). But for this protein, FUNGAP inherited the model from MAKER.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by User 401420

In case of this specific protein what is it most similar to when you do a blastp search? You may want/need to adjust the annotations (sequences) based on the homology you see.

ADD REPLYlink written 4 weeks ago by genomax62k

I did blastp and the only match is a response regulator (1118 aa) from a bacterium Maricaulis salignorans. I should have done blastp, not only interproscan. Perhaps it's better to remove this predicted protein?

ADD REPLYlink written 4 weeks ago by User 401420

Is the blastp match consistent over majority of the protein sequence excluding this bit at the end? Perhaps that part of the prediction is incorrect.

ADD REPLYlink written 4 weeks ago by genomax62k

The identity between the two proteins is only 36% and they are much different in sizes (63 aa vs 1118 aa). Actually it is not only one model that got the problem. Took some more for blastp and they seem correct with nice alignment to other proteins, but still contain a tiny tail of 2-3 aa at the end to generate the warning. I guess it is something to do with the genome as it is only at draft stage. But for the weird one, perhaps I can put it as a hypothetical protein and leave it there in the annotation?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by User 401420

So you have not done any diligence to check/correct for these errors? Obviously the software is wrong in this case.

ADD REPLYlink written 4 weeks ago by genomax62k

The pipeline does blastp/interproscan/blastn/BUSCO before choosing the best models. So I thought it is sufficient. Plus I don't really know how to check the models. May you suggest?

ADD REPLYlink written 4 weeks ago by User 401420

FunGAP performs gene prediction on given genome assembly and RNA-seq reads.

To confirm is this the kind of data you are processing?

Did you pre-filter your SPAdes contigs to eliminate small/redundant pieces? This pipeline must be taking all sequences you provided to it when doing predictions. Did you look at the logs to see what the BUSCO results were suggesting (in terms of completeness of sequence)?

Correct annotation is hard. Pipeline did what it was programmed to do but the results have to be vetted. Does the genome you are submitting have close relatives available? Were those considered in annotation comparison?

ADD REPLYlink written 4 weeks ago by genomax62k

Yes, I did nucmer to remove repetitive contigs <500 bp before feeding the genome to FUNGAP, but I did not look at the BUSCO results though since I thought FUNGAP would sieve out those small and highly fragmented models.

There is a genome of close relatives. I think the plan is .. I will take a look at the BUSCO results, search for those with short exon, and kick those with low completeness out if they are not present in the genome of the closely related species. :) Please suggest if there are additional steps I should consider.

ADD REPLYlink written 4 weeks ago by User 401420
0
gravatar for Juke-34
4 weeks ago by
Juke-341.8k
Sweden
Juke-341.8k wrote:

Hi, The best would be to manual curate it using a genome browser (Webapollo, Tablet or Artemis). If you have few of them it's doable. Otherwise you can anyway submit the data with the WARNINGs it will work. I often have that kind of cases when the prediction is abinitio. Indeed you can have an exon of 100 bases with only 10 bases that are coding and the rest is part of the UTR. Most of the abinitio will not predict the UTR part. So you end up with a short coding exon.

When I have plenty of short exons I use a script to extend the extremity ones (Indeed you cannot easily touch the internal ones) over the minimum size accepted by ENA, by adding arbitrary UTRs. If you are interested I can tell you where to find the script.

ADD COMMENTlink written 4 weeks ago by Juke-341.8k

Good to hear that you also experienced similar problem. The script would be fantastic. May you share me the script, please?

Thanks in advance!

ADD REPLYlink written 4 weeks ago by User 401420

You will have to git clone this repository https://github.com/NBISweden/GAAS and follow the installation procedure. Then use the script gff3_sp_fix_small_exon_from_extremities.pl.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Juke-341.8k

Thanks again Juke! :)

ADD REPLYlink written 4 weeks ago by User 401420

Otherwise you can anyway submit the data with the WARNINGs it will work.

Putting bad data into databases multiplies this problem since someone else down the road finds a hit to this erroneous entry and keeps propagating it.

I would rather see people submit no annotation if they have not done enough work on it. Have NCBI's automated pipelines do annotation since they probably have checks and balances built in. You can email NCBI to ask them to annotate a eukaryotic genome you want to eventually make public. Unfortunately EBI does not have the resources to do this on your side of the ocean.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax62k

I will carefully examine those models with short exon and will keep bad ones out of the annotation. But for those with proper/reasonable blastp hits, I will use the script to correct them. :)

ADD REPLYlink written 4 weeks ago by User 401420

Yes I agree, but as I explained it is not necessarily wrong. Lot of chance that the exon is short because it does include only the coding part. He has to investigate the annotation to better understand what's is happening.

ADD REPLYlink written 4 weeks ago by Juke-341.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1896 users visited in the last hour