Hello all,
I have been tasked with validating aberrantly expressed genes detected in Lexogen rna-sequencing data (single end, moderate sequencing depth) (and ran through my own pipeline) using TruSeq (paired end, high sequencing depth) sequencing data ran through the GTEx (Genotype-Tissue Expression) rna-seq pipeline. Samples ran through Lexogen AND TruSeq are biological replicates.
EDIT: The pipelines both include deduplication, quality control, alignment and gene quantification. For detecting aberrant expressed genes, gene specific thresholds are established. For example, if a value of gene expression is over said threshold, it is aberrantly overexpressed. The same approach applies to underexpression.
The logic behind this I believe is that if genes that are aberrantly detected from single-end, moderate sequencing depth are also in the paired end, high sequencing depth run and ran through a pipeline used by the Broad Institute, we can confirm that our pipeline is adequate in detecting aberrantly expressed genes?
I know that normally one would use qPCR or ddPCR to confirm expression levels but I don't have that option at the moment--unfortunately.
What do you all think of this approach? Or if you have validated data in this way, what were your experience? I would appreciate any comments, critiques of my assumptions or approach. I am fairly new to the field and am excited to learn from you all!
Best, VN
EDIT: Corrected my use of NextSeq as a library prep, added some more details on the analysis pipeline. Hope that helps!
You do not need any pipelines "validated" by any institute. If you follow the recommendations of the established DEG tools such as
DESeq2
oredgeR
you will be fine.I am a bit confused though, you say Nextseq and TruSeq as if that was an apple-to-apple comparison. Nextseq is an Illumina platform, TruSeq is a library prep kit. If you have two independent experiments and they show essentially the same result then you have your confirmation.
For further recommendations you should include details on what your pipeline and this "validated" one is doing and what exactly is the difference between these experiments beyond the sequencing platform and the depth.
Sorry! I often get those mixed up. I am comparing library prep kits. When I say NextSeq, what I meant was Lexogen. I will correct my post as your suggestion.
That is a very good point. I was thinking of the Truseq data as my "truth" set when instead I should be thinking it more along the lines of reproduciblity.
I will also correct my post to include more details about the pipelines.
That is typically not one in RNA-seq unless you have UMIs. I would skip that.
I hope there is some statistics involved. If not this is a terrible (no offense) approach. Fold changes without statistics are meaningless as you have no information how reliable they are. Low counts (so genes with few counts) will always give high fold-changs which are unreliable. A good statistical framework such as
edgeR
orDESeq2
take this into account plus the dispersion between biological replicates to calcilate a p-value that informs you about the reliablity. Please use any of these pipelines and do not do any custom approach. Manuals of both tools contain extensive explanation and example code. Use them! With just fold change thresholding you get plenty (probably hundreds) of false-positives.No offense taken at all... I have had similar thoughts... This is something that was already established before it was tasked to me haha. Now comes the issue of convincing others. Thank you for giving me some guidance though. I've learned a lot from your comments. :)
Do you still have the raw data (count matrix with raw counts)? Then I would simpyl do the DEG analysis with a proper framework again.
I'll give it a try! Thanks again for the advice. :)
No problem at all, if you need further advice please feel free to ask. I benefitted so much from this community, happy to provide help to others if I can.