TransDecoder can be used to predict coding sequences (CDS) (and eventually also translate them into amino acid sequences) from nucleotide fasta files.
TransDecoder predicts more than one CDS per sequence, as it translates in all 6 frames and also predicts incomplete CDS.
I have a de novo assembled transcriptome from which I have predicted a set of CDS, and I am interested in assessing the read support for these sequences.
My question is: would it be a mistake to try and map the entire set of reads to the CDS file to assess read support here? Some incomplete, unformulable intuition tells me that using the entire read set could lead to spurious read support as reads that were not used in constructing the parent transcript can end up mapping to a derived CDS by chance. I also "feel" (not think) that the multimapping rate would be quite high because of the many-to-one relationship between the transcripts and the (therefrom derived) CDS.
If I should not use the entire read set, then this leads to my second question: how do I subset the set of reads that were actually used to construct the transcripts? Can I first map all reads against the assembly, select only those reads that did map against the assembly, and then try and map these against the CDS file(s)?
The ultimate objective of this is to try and somehow "rank" the various CDS for each transcript based on read support. (Comments on this are also welcome.)
Your inputs would be much appreciated.