16 months ago by

Republic of Ireland

People generally inspect for outliers *visually* by observing the PCA bi-plot for principal components 1 and 2 (see my post here: A: PCA in a RNA seq analysis ). For RNA-seq, a sample that has genuinely 'failed' and whose data is skewed due to extraneous factors unrelated to the biological condition of interest will typically be a magnitude of ~200 to 1 000 from the main group of samples along PC1 - these are very easy to identify and don't usually require statistical justification.

If we do want to quantify what it is to be an outlier (*To be an outlier, or not to be*), we usually identify any sample that falls outside the main group of samples by a magnitude (along PC1) of greater than 3 standard deviations. Mathematically, all that you need to do is convert your PC1 values to Z-scores and then check for those >|3|. In R, get these by using prcomp() and then accessing the 'x' variable of the returned object, e.g., `pca <- prcomp(t(rna.data); pca$x`

The method that you've mentioned is published in a reputable journal and therefore justified, in my opinion. I would just ask that you check the following before using it: Does the algorithm expect counts as a negative binomial distribution (e.g. normalized counts in EdgeR or DESeq2) or a binomial distribution (logged normalised counts)?

Good luck

Kevin