Hi,
I am attempting to carry out genome-wide Dn/Ds analyses, and have predicted coding sequences and proteins from two species using Augustus. My problem is that the Augustus coding and protein sequences do not completely match up, because some of the coding sequences from incomplete genes carry extra sequence information at either their 5' or 3' ends. For instance, a gene that is incomplete at the 5' end may have one or two extra nucleotides prior to the start of the ORF that make it so that the protein is not encoded in the first reading frame. The opposite problem is true if the 3' end is incomplete: often, there is an extra 1-2 nucleotides. The Dn/Ds programs that I have been trying to use do not like this!
Does anyone have any ideas how I can trim the Augustus coding sequence predictions so that they're all contained within the first reading frame? I noticed that the Augustus getAnno.pl file has a flag called '--chop_cds' that seems like it should work, but I've found that it doesn't do what I want.
Thanks for your help! Ryan
With
agat_sp_extract_sequences.pl
from AGAT you can extract the CDS and decide if you clip or not the first base(s) to start the sequence in the frame. It is maybe what you are looking for.Thanks Juke34. AGAT seems ideal from the perspective of trimming the offset 5' nucleotides, but does it have the capacity to trim off 3' bases that aren't in frame? I realize that I could probably script this, but it would be great to have a tool that does this too!
No it does not. But once 5’ is trimmed you can probably find a simple command to clean the 3’ side. Indeed using a modulo 3 you will know how many nucleotide you must remove at the end of the sequence.