Question: Predicting coding sequence region from GFF with exons + reference genome
0
gravatar for Grey Monroe
12 months ago by
Grey Monroe10
Max Planck Institute
Grey Monroe10 wrote:

I have inherited a collection genome annotation files (gff3) for several newly assembled genomes in which I discovered that the coding region coordinates are incorrect. I'd like to remove the coding region coordinates and re-predict from the exons.

The gff files were created using Comparative-Annotation-Toolkit (CAT / Augustus), using a combination of RNA-seq data and lift over from the reference genome for this species. The exon-intron structure appears to be correct in the new genomes. However, the problem seems to be that the start and stop coordinates for the coding regions (CDS) have been forced onto the new genomes even in cases where they produce amino acid sequences that don't make any sense (ie does not begin with Met, has stop codons in the middle of sequence, or does not end with stop codon).

I would be open to other suggestions, but having spent some time working on it, I've decided to try to re-predict the CDS coordinates from the gff file (remove CDS regions and re-predict reading frame in the exons).

Can someone point me to a method in which the input files are a GFF with exons + reference genome to call coding region coordinates?

Thank you!

ADD COMMENTlink modified 12 months ago by Juke344.8k • written 12 months ago by Grey Monroe10
0
gravatar for Juke34
12 months ago by
Juke344.8k
Sweden
Juke344.8k wrote:

I have a perl script for that purpose in the AGAT toolkit (conda install -c bioconda agat ):
agat_sp_fix_longest_ORF.pl

ADD COMMENTlink modified 7 months ago • written 12 months ago by Juke344.8k

Thanks Juke, so from the name of the script, I assume it just looks at all potential ATG start and selects the one with the longest sequence before a *stop codon?

ADD REPLYlink written 12 months ago by Grey Monroe10

It extract the current CDS to look at the size (doesn't look at the presence of stop, start), then it extracts the exons, and does a prediction. It compares the length of the new prediction and classify them into 5 different cases (called model):

Model1 = original sequence is part of new prediction; the predicted one is longest
Model2 = sequence original predicted are different; the  predicted one is longest, they don't overlap each other. 
Model3 = original protein and predicted one are different; the predicted one is longest, they overlap each other. 
Model4 = The prediction is shorter.
Model5 = The prediction is same size but not correct frame (+1 or +2 bp gives frame shift).

According to the model you activate (e.g. --model 1,4), if a prediction in a locus fall in one of this case it will replace the CDS.

P.S: (I just update the repo the link was broken (it was called gff3_sp_fix_longestORF.pl but I had changed it to gff3_sp_fix_longest_ORF.pl), so do a git pull )

ADD REPLYlink modified 12 months ago • written 12 months ago by Juke344.8k

do a prediction

What do you mean? Does it find a CDS that makes sense (ie starts with start codon and ends with stop codon?)

ADD REPLYlink written 12 months ago by Grey Monroe10

Yes it predicts a CDS with start and stop

ADD REPLYlink written 12 months ago by Juke344.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1642 users visited in the last hour