Question: Threshold to exclude samples based on sample-to-sample distances
gravatar for Tobias.Wohland
3.0 years ago by
Tobias.Wohland60 wrote:

Hi, I have a RNA-Seq experiment and I'm wondering at which point I should exclude samples from my analysis? I couldn't find a good answer with googles help.

I clearly see (at least I would say it) in the sample-to-sample distance heatmap and a simple PCA-plot that one of my samples within one of the treatment groups differs from the other samples of the same group. But is there a threshold which I can use to say "yes" or "no" to the exclusion?

Please find attached both plots. The guy in the bottom left corner of the PCA-plot is the sample "SD_2" (see Distance plot).

sample-to-sample distance

enter image description here

Thanks for your help in advance.

Best, Tobi

rna-seq R • 1.4k views
ADD COMMENTlink modified 2.8 years ago by Dan Gaston7.1k • written 3.0 years ago by Tobias.Wohland60

Hi Tobias,

Have you ever figured out an answer to your problem? I'm wondering about this as well...

Best, Janne

ADD REPLYlink written 2.8 years ago by Janne.Swaegers0
gravatar for Dan Gaston
2.8 years ago by
Dan Gaston7.1k
Dan Gaston7.1k wrote:

Seeing this question now thanks to Janne's bumping. I don't think there is a good threshold you can use for this sort of thing that would work in all cases. Most people would simply determine that a sample is an outlier based on the visualizations and then manually remove it from their analysis. This is part of the "art" of data analysis. Machine learning approaches and clustering can achieve what you want, but it seems like overkill versus simple visualization like you have already done.

ADD COMMENTlink written 2.8 years ago by Dan Gaston7.1k

I agree, but where do you draw the line in this. When does removing outliers become hypothesis-driven data-manipulation?

Given that (in this example and many others) the sample size is small you are sampling individuals from a larger population without knowing the true heterogeneity in the population.

ADD REPLYlink written 2.8 years ago by WouterDeCoster40k

absolutely agree. In RNA-Seq experiments, when I see this sort of pattern I tend to run parallel analyses with and without the sample included and look for other evidence that it is really behaving oddly. If I'm dealing with clinical samples I (or someone) goes back to look at clinical data to see if there is anything about the sample. Usually I find something here and can simply drop it from the analysis because it was mischaracterized at the phenotype level.

That said a few points 1) For an RNA-Seq experiment 6 samples in a group isn't really a small number of samples. I realize that as costs drop the number of samples people run is increasing but this is about double the number of samples you usually have in a group. 2) That light blue group of samples is definitely more variable than the red cluster in the PCA, but even then the sample in the lower left is quite far outside the distribution of normal variances.

I'd think hard about removing it. IN this case I would probably do the parallel analysis approach.

ADD REPLYlink written 2.8 years ago by Dan Gaston7.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 648 users visited in the last hour