Question: Predicting coding sequence region from GFF with exons + reference genome
0
gravatar for Grey Monroe
3 months ago by
Grey Monroe10
Max Planck Institute
Grey Monroe10 wrote:

I have inherited a collection genome annotation files (gff3) for several newly assembled genomes in which I discovered that the coding region coordinates are incorrect. I'd like to remove the coding region coordinates and re-predict from the exons.

The gff files were created using Comparative-Annotation-Toolkit (CAT / Augustus), using a combination of RNA-seq data and lift over from the reference genome for this species. The exon-intron structure appears to be correct in the new genomes. However, the problem seems to be that the start and stop coordinates for the coding regions (CDS) have been forced onto the new genomes even in cases where they produce amino acid sequences that don't make any sense (ie does not begin with Met, has stop codons in the middle of sequence, or does not end with stop codon).

I would be open to other suggestions, but having spent some time working on it, I've decided to try to re-predict the CDS coordinates from the gff file (remove CDS regions and re-predict reading frame in the exons).

Can someone point me to a method in which the input files are a GFF with exons + reference genome to call coding region coordinates?

Thank you!

ADD COMMENTlink modified 3 months ago by Juke-343.3k • written 3 months ago by Grey Monroe10
0
gravatar for Juke-34
3 months ago by
Juke-343.3k
Sweden
Juke-343.3k wrote:

I have a perl script for that purpose in the GAAS repository: gff3_sp_fix_longest_ORF.pl

ADD COMMENTlink written 3 months ago by Juke-343.3k

Thanks Juke, so from the name of the script, I assume it just looks at all potential ATG start and selects the one with the longest sequence before a *stop codon?

ADD REPLYlink written 3 months ago by Grey Monroe10

It extract the current CDS to look at the size (doesn't look at the presence of stop, start), then it extracts the exons, and does a prediction. It compares the length of the new prediction and classify them into 5 different cases (called model):

Model1 = original sequence is part of new prediction; the predicted one is longest
Model2 = sequence original predicted are different; the  predicted one is longest, they don't overlap each other. 
Model3 = original protein and predicted one are different; the predicted one is longest, they overlap each other. 
Model4 = The prediction is shorter.
Model5 = The prediction is same size but not correct frame (+1 or +2 bp gives frame shift).

According to the model you activate (e.g. --model 1,4), if a prediction in a locus fall in one of this case it will replace the CDS.

P.S: (I just update the repo the link was broken (it was called gff3_sp_fix_longestORF.pl but I had changed it to gff3_sp_fix_longest_ORF.pl), so do a git pull )

ADD REPLYlink modified 3 months ago • written 3 months ago by Juke-343.3k

do a prediction

What do you mean? Does it find a CDS that makes sense (ie starts with start codon and ends with stop codon?)

ADD REPLYlink written 3 months ago by Grey Monroe10

Yes it predicts a CDS with start and stop

ADD REPLYlink written 3 months ago by Juke-343.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1310 users visited in the last hour