16 months ago by
Republic of Ireland
People generally inspect for outliers visually by observing the PCA bi-plot for principal components 1 and 2 (see my post here: A: PCA in a RNA seq analysis ). For RNA-seq, a sample that has genuinely 'failed' and whose data is skewed due to extraneous factors unrelated to the biological condition of interest will typically be a magnitude of ~200 to 1 000 from the main group of samples along PC1 - these are very easy to identify and don't usually require statistical justification.
If we do want to quantify what it is to be an outlier (To be an outlier, or not to be), we usually identify any sample that falls outside the main group of samples by a magnitude (along PC1) of greater than 3 standard deviations. Mathematically, all that you need to do is convert your PC1 values to Z-scores and then check for those >|3|. In R, get these by using prcomp() and then accessing the 'x' variable of the returned object, e.g.,
pca <- prcomp(t(rna.data); pca$x
The method that you've mentioned is published in a reputable journal and therefore justified, in my opinion. I would just ask that you check the following before using it: Does the algorithm expect counts as a negative binomial distribution (e.g. normalized counts in EdgeR or DESeq2) or a binomial distribution (logged normalised counts)?