Automatic Outlier Detection for RNA-seq data
2
0
Entering edit mode
3.4 years ago
JJ ▴ 560

Hi all,

So I am looking for an automated approach to detect outliers in RNA-seq data. I usually looked at a PCA plot and decided visually. Now I would like to automate this. So I have been looking at the PcaHubert() function in rrcov package, which then flags suspected outliers as false:

pca <- PcaHubert(t(rna.data))
outliers <- which(pca@flag=='FALSE')


Would this be a good option? Or are there others better suited for RNA-seq data?

RNA-Seq outlier rrcov • 5.3k views
8
Entering edit mode
3.4 years ago

People generally inspect for outliers visually by observing the PCA bi-plot for principal components 1 and 2 (see my post here: A: PCA in a RNA seq analysis ). For RNA-seq, a sample that has genuinely 'failed' and whose data is skewed due to extraneous factors unrelated to the biological condition of interest will typically be a magnitude of ~200 to 1 000 from the main group of samples along PC1 - these are very easy to identify and don't usually require statistical justification.

If we do want to quantify what it is to be an outlier (to mis-quote Skakespeare: "To be an outlier, or not to be"), we usually identify any sample that falls outside the main group of samples by a magnitude (along PC1) of greater than 3 standard deviations. Mathematically, all that you need to do is convert your PC1 values to Z-scores and then check for those >|3|. In R, get these by using prcomp() and then accessing the 'x' variable of the returned object, e.g., pca <- prcomp(t(rna.data); pca\$x

The method that you've mentioned is published in a reputable journal and therefore justified, in my opinion. I would just ask that you check the following before using it: Does the algorithm expect counts as a negative binomial distribution (e.g. normalized counts in EdgeR or DESeq2) or a normal distribution (logged normalised counts)?

Good luck

Kevin

1
Entering edit mode

Hello Kevin. I am seeking automatic outlier detection method but couldn't find yet. Cook's distance looks good but maybe more suitable to detect gene outlier not sample. Isolation forest maybe a good way, I need to try first. And your method Maybe not very suitable for patient(clinical) data which PC1's Proportion of Variance may not high for example only around 0.5.

0
Entering edit mode

Thank you very much for your input! I used voom transformed RSEM values - so log2CPM. Do you know if the algorithm can work with this? Thanks!

1
Entering edit mode

Hi friend,

You may want to check the distribution of the data with the hist() function in R, and then share the figure here (or just decide yourself if its normally distributed). To share an image here, just upload here, and then share the URL in your comment/reply.

In the manual for rrcov, which I believe is used by PcaHubert, they state:

These estimates are optimal if the data come from a multivariate normal distribution but are extremely sensitive to the presence of even a few outliers (atypical values, anomalous observations, gross errors) in the data.

[source: https://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf]

So, it looks like having your data normally distributed would be optimal.

0
Entering edit mode

Hello Kevin, Is there any chance you remember which paper used this PC1zscore >|3| method? I would like to read and/or cite it. Thanks!

0
Entering edit mode

Hey, I do not have any citations - it is just a general way to detect outliers. It would likely only appear in supplementary methods, or not at all. I think that it is okay to justify the removal of outliers by eye, too.

In most statements, people would write: "X samples were removed after visual inspection of a PCA bi-plot"

1
Entering edit mode
6 days ago
Melisa ▴ 10

I think in this publication, the outliers removal in that way is justified https://iopscience.iop.org/article/10.1088/1742-6596/705/1/012003/pdf