Hello everyone,
I’m working with short-read RNA-Seq data from plants. Sequencing was performed on the BGI platform DNBSEQ-T7 (paired-end, insert size 150, 6G/sample). We sequenced mRNA from 13 species, 3 tissues/species, 3 replicate/tissue, so a looot of data:) The goal of the study is to find biosynthetic genes responsible for the biosynthesis of secondary metabolites.
Transcriptomes were assembled de novo (using rnaSPAdes) for each species separately. ORF were predicted using TransDecoder:
TransDecoder.LongOrfs -t input --output_dir transdecoder
TransDecoder.Predict -t input --output_dir transdecoder --retain_pfam_hits {input.pfam} --retain_blastp_hits {input.blastp}
The final dataset contain over 1’600’000 potential ORF (see summary below):
I’m wondering whether I can use the ORF score from TransDecoder to filter out some noise from the data (e.g., assembly artifacts). I know filtering is always risky, but also flagging low-score ORFs as "unrealiable" would be a starting point. I tried to plot the distribution of ORF score from my entire dataset:
The max count corresponds to score = 12.2, but I also have over 100'000 ORFs with score below 1. Would you consider those artifacts and/or unreliable ORFs? Or you never know?
Thanks in advance for any answer!