TransDecoder
can be used to predict coding sequences (CDS) (and eventually also translate them into amino acid sequences) from nucleotide fasta files. TransDecoder
predicts more than one CDS per sequence, as it translates in all 6 frames and also predicts incomplete CDS.
I have a de novo assembled transcriptome from which I have predicted a set of CDS, and I am interested in assessing the read support for these sequences.
My question is: would it be a mistake to try and map the entire set of reads to the CDS file to assess read support here? Some incomplete, unformulable intuition tells me that using the entire read set could lead to spurious read support as reads that were not used in constructing the parent transcript can end up mapping to a derived CDS by chance. I also "feel" (not think) that the multimapping rate would be quite high because of the many-to-one relationship between the transcripts and the (therefrom derived) CDS.
If I should not use the entire read set, then this leads to my second question: how do I subset the set of reads that were actually used to construct the transcripts? Can I first map all reads against the assembly, select only those reads that did map against the assembly, and then try and map these against the CDS file(s)?
The ultimate objective of this is to try and somehow "rank" the various CDS for each transcript based on read support. (Comments on this are also welcome.)
Your inputs would be much appreciated.
Edit: I suppose I must mention that this post sort of follows up on these two posts, respectively.
HI ponganta I wasn't really able to reach a conclusion I suppose. The main reason I was (and am) interested in this is because I wanted to find a way to choose the "best" protein isoform (as predicted by
TransDecoder
) based on read support. I ended up ultimately sort of dodging this question. I am none too happy to have done this, however.Did you check how often it is that you got the
CDS-UTR-CDS
situation? Actually, how did you end up finding these contigs anyway? I had been operating under the assumption thatTransDecoder
was mostly predicting nested ORFs. I never thought about the consecutive scenario.I am not very surprised by the former (statement) but I am by the latter. Did you retain all protein isoforms for the differential analysis?
Thank you for contributing to the topic!!
Hi Dunois,
No, I only used successfully annotated isoforms. Also, I aggregated to gene-level. This may very well be the reason for the lack of a difference. However, this may demonstrate the reproducibility of gene-level transcript quantification, even in the presence of chimeric transcripts.
When using transdecoder via dammit!, the precise location of the extracted CDS is included in the headers of the longest_orfs.pep (start & endpoint), as well as its UTRs. I took a look at single transcripts with more than one CDS, and found valid, concatenated CDSs, especially with plastidary genes. I would be glad, if you could share you knowledge once a solution is available!
I think this explains why the read support dropped. A good subset of sequences usually end up not getting annotated in these de novo scenarios, I believe.
I see!! So it's from the
TransDecoder
output. I never thought to actually ever analyze that, thank you for this tip!! And I am surprised you gotDammit!
to work at all. I just gave up on that tool.I will!! And the same goes for you!!
Regarding dammit!: Use the conda-version and manually replace links to the BUSCO-databases. Then it should work like a charm. The author (Camille Scott) has created a true marvel here, sad that its not used that much.
Were the
BUSCO
links the only issue you encountered during installation? I just gave up and went withEnTAP
instead. It was quite the hassle to install as well, but it ran well in the end.Yes, they were the only problem encountered. However, I did not use the entire RefSeq-database for annotation, but only one custom database (one closely related proteome),