I have inherited a collection genome annotation files (gff3) for several newly assembled genomes in which I discovered that the coding region coordinates are incorrect. I'd like to remove the coding region coordinates and re-predict from the exons.
The gff files were created using Comparative-Annotation-Toolkit (CAT / Augustus), using a combination of RNA-seq data and lift over from the reference genome for this species. The exon-intron structure appears to be correct in the new genomes. However, the problem seems to be that the start and stop coordinates for the coding regions (CDS) have been forced onto the new genomes even in cases where they produce amino acid sequences that don't make any sense (ie does not begin with Met, has stop codons in the middle of sequence, or does not end with stop codon).
I would be open to other suggestions, but having spent some time working on it, I've decided to try to re-predict the CDS coordinates from the gff file (remove CDS regions and re-predict reading frame in the exons).
Can someone point me to a method in which the input files are a GFF with exons + reference genome to call coding region coordinates?