Question

TransDecoder ORF score to filter "noise" from de novo transcriptome assembly

0

Entering edit mode

26 days ago

tdamiani • 0

Hello everyone,

I’m working with short-read RNA-Seq data from plants. Sequencing was performed on the BGI platform DNBSEQ-T7 (paired-end, insert size 150, 6G/sample). We sequenced mRNA from 13 species, 3 tissues/species, 3 replicate/tissue, so a looot of data:) The goal of the study is to find biosynthetic genes responsible for the biosynthesis of secondary metabolites.

Transcriptomes were assembled de novo (using rnaSPAdes) for each species separately. ORF were predicted using TransDecoder:

TransDecoder.LongOrfs -t input --output_dir transdecoder
TransDecoder.Predict -t input --output_dir transdecoder --retain_pfam_hits {input.pfam} --retain_blastp_hits {input.blastp}

The final dataset contain over 1’600’000 potential ORF (see summary below): cds overview

I’m wondering whether I can use the ORF score from TransDecoder to filter out some noise from the data (e.g., assembly artifacts). I know filtering is always risky, but also flagging low-score ORFs as "unrealiable" would be a starting point. I tried to plot the distribution of ORF score from my entire dataset: ORF score distribution

The max count corresponds to score = 12.2, but I also have over 100'000 ORFs with score below 1. Would you consider those artifacts and/or unreliable ORFs? Or you never know?

Thanks in advance for any answer!

transdecoder assembly transcriptome • 132 views

ADD COMMENT • link updated 26 days ago by Ram 43k • written 26 days ago by tdamiani • 0