Question: Automatic Outlier Detection for RNA-seq data
0
gravatar for JJ
22 months ago by
JJ460
JJ460 wrote:

Hi all,

So I am looking for an automated approach to detect outliers in RNA-seq data. I usually looked at a PCA plot and decided visually. Now I would like to automate this. So I have been looking at the PcaHubert() function in rrcov package, which then flags suspected outliers as false:

pca <- PcaHubert(t(rna.data)) 
outliers <- which(pca@flag=='FALSE')

Would this be a good option? Or are there others better suited for RNA-seq data?

Thanks for your input!

rna-seq outlier rrcov • 2.5k views
ADD COMMENTlink modified 22 months ago by Kevin Blighe48k • written 22 months ago by JJ460
4
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe48k
Kevin Blighe48k wrote:

People generally inspect for outliers visually by observing the PCA bi-plot for principal components 1 and 2 (see my post here: A: PCA in a RNA seq analysis ). For RNA-seq, a sample that has genuinely 'failed' and whose data is skewed due to extraneous factors unrelated to the biological condition of interest will typically be a magnitude of ~200 to 1 000 from the main group of samples along PC1 - these are very easy to identify and don't usually require statistical justification.

If we do want to quantify what it is to be an outlier (to mis-quote Skakespeare: "To be an outlier, or not to be"), we usually identify any sample that falls outside the main group of samples by a magnitude (along PC1) of greater than 3 standard deviations. Mathematically, all that you need to do is convert your PC1 values to Z-scores and then check for those >|3|. In R, get these by using prcomp() and then accessing the 'x' variable of the returned object, e.g., pca <- prcomp(t(rna.data); pca$x

The method that you've mentioned is published in a reputable journal and therefore justified, in my opinion. I would just ask that you check the following before using it: Does the algorithm expect counts as a negative binomial distribution (e.g. normalized counts in EdgeR or DESeq2) or a normal distribution (logged normalised counts)?

Good luck

Kevin

ADD COMMENTlink modified 4 months ago • written 22 months ago by Kevin Blighe48k

Thank you very much for your input! I used voom transformed RSEM values - so log2CPM. Do you know if the algorithm can work with this? Thanks!

ADD REPLYlink written 22 months ago by JJ460
1

Hi friend,

You may want to check the distribution of the data with the hist() function in R, and then share the figure here (or just decide yourself if its normally distributed). To share an image here, just upload here, and then share the URL in your comment/reply.

In the manual for rrcov, which I believe is used by PcaHubert, they state:

These estimates are optimal if the data come from a multivariate normal distribution but are extremely sensitive to the presence of even a few outliers (atypical values, anomalous observations, gross errors) in the data.

[source: https://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf]

So, it looks like having your data normally distributed would be optimal.

ADD REPLYlink modified 6 months ago • written 22 months ago by Kevin Blighe48k

Hello Kevin, Is there any chance you remember which paper used this PC1zscore >|3| method? I would like to read and/or cite it. Thanks!

ADD REPLYlink written 6 months ago by manninm0

Hey, I do not have any citations - it is just a general way to detect outliers. It would likely only appear in supplementary methods, or not at all. I think that it is okay to justify the removal of outliers by eye, too.

In most statements, people would write: "X samples were removed after visual inspection of a PCA bi-plot"

ADD REPLYlink written 6 months ago by Kevin Blighe48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1555 users visited in the last hour