Question: Is TransDecoder predicting the "true" set of protein-coding regions?
1
gravatar for Santiago Montero-Mendieta
24 months ago by
Sweden

Dear everyone,

I have downloaded from TSA (Transcriptome Shotgun Assembly) the contig sequences of the same species but from two different BioProject (same authors, but different studies). One file contains ~800,000 sequences while the other has ~400,000 sequences.

I'm interested in identifying protein-coding regions and I'm using TransDecoder for that purpose. After running TransDecoder I have gotten ~300,000 and ~150,000 protein-coding regions, respectively. I'm aware that TransDecoder looks for possible ORF in all 6 reading frames, and so the initial amount of contig sequences is possibly correlated with the final number of proteins.

However, I'm wondering how can one infer the "true" (i.e. closest to reality) set of protein-coding regions for a species? For example, the proteome of Xenopus tropicalis contains right now 39,662 sequences (or mRNAs as stated here) and Anolis carolinensis 32,230. So why do I get so many proteins and how can I get a more realistic number?

Thanks!

orf rna-seq transdecoder • 963 views
ADD COMMENTlink modified 21 months ago by Biostar ♦♦ 20 • written 24 months ago by Santiago Montero-Mendieta110
2

I recommend you read the manual since it includes a way to include Blastp and Pfam searches to select coding regions.

ADD REPLYlink written 24 months ago by biofalconch390

Thanks for your suggestion @biofalconch , you are right, I knew about this optional step but I did not use it. I agree I would get less sequences including blastp or pfam searches, but what about novel proteins that are not in the reference databases? That's why I did not use it before... :(

ADD REPLYlink modified 24 months ago • written 24 months ago by Santiago Montero-Mendieta110

about novel proteins that are not in the reference databases

Unless you are working with an extreme outlier, there should be something with hints of reasonable homology in current protein databases.

ADD REPLYlink modified 24 months ago • written 24 months ago by genomax64k

One reason would be if these sequences consist of multiple isoforms instead of only the longest isoform. Different splice-forms from the same transcript can give multiple CDS.

ADD REPLYlink written 24 months ago by Rohit1.3k

Thanks @Rohit , in both datasets there is only unique IDs, so I'm assuming that the authors kept only the longest isoform per gene before publishing the contig sequences in TSA.

ADD REPLYlink written 24 months ago by Santiago Montero-Mendieta110
1

Isoforms I cant be sure of with just the unique ID - what if there was pre-processing for changing the transcript names into unique ones. There is no mention in TSA about keeping only the longest isoform of the transcript. If there is a reference genome, mapping onto it with splice-aware mappers to make sure would definitely help. Else as @genomax suggested, there wouldn't be a huge difference

ADD REPLYlink modified 24 months ago • written 24 months ago by Rohit1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour