Assessing read support for coding sequences predicted by TransDecoder
1
1
Entering edit mode
20 months ago
Dunois ★ 2.0k

TransDecoder can be used to predict coding sequences (CDS) (and eventually also translate them into amino acid sequences) from nucleotide fasta files. TransDecoder predicts more than one CDS per sequence, as it translates in all 6 frames and also predicts incomplete CDS.

I have a de novo assembled transcriptome from which I have predicted a set of CDS, and I am interested in assessing the read support for these sequences.

My question is: would it be a mistake to try and map the entire set of reads to the CDS file to assess read support here? Some incomplete, unformulable intuition tells me that using the entire read set could lead to spurious read support as reads that were not used in constructing the parent transcript can end up mapping to a derived CDS by chance. I also "feel" (not think) that the multimapping rate would be quite high because of the many-to-one relationship between the transcripts and the (therefrom derived) CDS.

If I should not use the entire read set, then this leads to my second question: how do I subset the set of reads that were actually used to construct the transcripts? Can I first map all reads against the assembly, select only those reads that did map against the assembly, and then try and map these against the CDS file(s)?

The ultimate objective of this is to try and somehow "rank" the various CDS for each transcript based on read support. (Comments on this are also welcome.)

Your inputs would be much appreciated.

Edit: I suppose I must mention that this post sort of follows up on these two posts, respectively.

cds prediction transdecoder de novo assembly • 1.2k views
1
Entering edit mode
16 months ago
ponganta ▴ 530

Hi,

did you come to a conclusion yet? I'm currently wondering the same thing. Conducted de novo transcriptome assembly, but noticed that some contigs are basically concatenated genes (valid CDS, with BLAST hits, separated by 5' and 3' UTR). Transdecoder was capable of disentangling those CDS. Wouldn't it be more precise to quantify with CDS instead of assembled transcripts?

When using Salmon with the CDS instead of assembled transcript as reference, read support dropped dramatically (from 99 % to 70 %). Subsequent analyses using DESeq2 did not show any difference in data quality (dispersion plots for assembled transcripts & CDS were almost identical).

My conclusion is, that chimeric transcripts are chimeric for a reason: they seem to be assembled together due to overlapping read support. I could be entirely wrong though, and I'd be glad if someone has a qualified answer! It's an interesting topic.

Hope this helps, sorry if it doesn't!

Lukas

1
Entering edit mode

HI ponganta I wasn't really able to reach a conclusion I suppose. The main reason I was (and am) interested in this is because I wanted to find a way to choose the "best" protein isoform (as predicted by TransDecoder) based on read support. I ended up ultimately sort of dodging this question. I am none too happy to have done this, however.

Did you check how often it is that you got the CDS-UTR-CDS situation? Actually, how did you end up finding these contigs anyway? I had been operating under the assumption that TransDecoder was mostly predicting nested ORFs. I never thought about the consecutive scenario.

When using Salmon with the CDS instead of assembled transcript as reference, read support dropped dramatically (from 99 % to 70 %). Subsequent analyses using DESeq2 did not show any difference in data quality (dispersion plots for assembled transcripts & CDS were almost identical).

I am not very surprised by the former (statement) but I am by the latter. Did you retain all protein isoforms for the differential analysis?

Thank you for contributing to the topic!!

1
Entering edit mode

Hi Dunois,

Did you retain all protein isoforms for the differential analysis?

No, I only used successfully annotated isoforms. Also, I aggregated to gene-level. This may very well be the reason for the lack of a difference. However, this may demonstrate the reproducibility of gene-level transcript quantification, even in the presence of chimeric transcripts.

how did you end up finding these contigs anyway

When using transdecoder via dammit!, the precise location of the extracted CDS is included in the headers of the longest_orfs.pep (start & endpoint), as well as its UTRs. I took a look at single transcripts with more than one CDS, and found valid, concatenated CDSs, especially with plastidary genes. I would be glad, if you could share you knowledge once a solution is available!

0
Entering edit mode

No, I only used successfully annotated isoforms.

I think this explains why the read support dropped. A good subset of sequences usually end up not getting annotated in these de novo scenarios, I believe.

When using transdecoder via dammit!,

I see!! So it's from the TransDecoder output. I never thought to actually ever analyze that, thank you for this tip!! And I am surprised you got Dammit! to work at all. I just gave up on that tool.

I would be glad, if you could share you knowledge once a solution is available!

I will!! And the same goes for you!!

1
Entering edit mode

Regarding dammit!: Use the conda-version and manually replace links to the BUSCO-databases. Then it should work like a charm. The author (Camille Scott) has created a true marvel here, sad that its not used that much.

0
Entering edit mode

Were the BUSCO links the only issue you encountered during installation? I just gave up and went with EnTAP instead. It was quite the hassle to install as well, but it ran well in the end.

1
Entering edit mode

Yes, they were the only problem encountered. However, I did not use the entire RefSeq-database for annotation, but only one custom database (one closely related proteome),