Does anyone know how to extract CDS from transcripts? I have transcripts structured like this:
Xspecies1234-mRNA transcript offset:272 AED:0.06 eAED:0.06 QI:272|1|1|1|0.8|0.83|6|66|318
Notice the transcript offset
part. I have the issue that excluding the 272 bases prior to the offset start codon is not all I need to do to isolate the coding sequence, because the stop codon is somewhere before the end of the transcript. I can fix them manually, but going through thousands of transcripts is extremely time-consuming. Any ideas of how to automate this for a multi-fasta? I'm looking for a way to not only cut off the part prior to the offset but also the part after a within-reading-frame stop codon.
give us a full example please (fasta...)
I'd use transdecoder for this if possible , or maybe the newer alternative TD2 - https://github.com/Markusjsommer/TD2/wiki