I know someone posted similar topics before (rna seq replicates not clustering ). However, here, I want to prompt some discussions on this topic with given plots and examples here.
I have triplicates for three conditions (Ev. Mut and WT) in mouse cells. I followed standard ways people usually used to analyze RANseq data: alignment (STAR) -> raw count (HTseq with intersection-strict mode) -> TMM normalization on edgeR -> VSN to 'disassociate dependence of the variance on the mean intensity' -> use VSN value for exploratory analysis (QC).
I did both PCA and pairwise-correlation clustering to find 'outlier'.
- PCA plot:
- pairwise-correlation heatmap:
From PCA plot, I can generally tell one sample from each condition (wt_rep2, ev_rep1 and mut_rep1) are obviously outliers. pairwise-correlation confirms that outlier. However, the rest of samples are not obviously clustered based on conditions. Yes, those data are totally messed up.
Now here are three options I plan to proceed (with questions):
- delete outliers and use rest samples for DE analysis. Since I have only 3 replicates, how could I just two samples for DE analysis?
- treat outlier and normal clustering as batch effect, incorperate batch effect as covariant in design.matrix (~ condition + batch), even though I have no idea what possible condition-independent factors cause them as outliers. Is this valid?
- Ignore QC, go ahead with DE analysis. Then what is the point to do a QC exploratory analysis?