Question: Sample outlier detection in DESeq2
gravatar for nikitavlassenko
21 months ago by
nikitavlassenko40 wrote:

I have 16 samples of bulk-RNA-seq data. There are 4 groups, so 4 samples are in each group. I want to check for outliers in each of the groups, and also get some characteristic estimate for each sample in each group of how much of an outlier it is. I am using DESeq2. There are threads that I found in Google, but I do not think they answer my question:

With this one I get a bunch of plots which all look approximately the same. abline() for some reason is not drawn for me, not sure what is the purpose of it (showing genes outliers? I am not interested in it).

If we look to the official documentation:

The assays(dds)[["cooks"]] gives the measures for outliers for each gene and sample, but I need to get it just for each sample in each group. I did a boxplot and it looks like that:


The PCA looks like that:

Which makes me think that there is an outlier in B6 group, the rightmost sample. The output from summary(res) gives the following:

What metrics to use to assess how much of an outlier each sample is in each group? I am a bit confused.

rna-seq deseq2 • 2.4k views
ADD COMMENTlink modified 21 months ago by genomax76k • written 21 months ago by nikitavlassenko40

For future reference : How to add images to a Biostars post

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax76k
gravatar for russhh
21 months ago by
UK, U. Glasgow
russhh4.9k wrote:

Don't be too quick to throw away samples. That B6 sample is no more an outlier than the ET sample on the far-left, by eye. Run PCA on subsamples: subselect genes and use leave-one-out over the samples (you mentioned 16 samples, but presented 12); for each plie of leave-one-out, check which group your held-out sample clusters with (which is the nearest other point, or which is the nearest mean point in the PCA). What values are you passing to pca: log-counts or log-fpkm or on some zero-shifted version of the counts? Have you filtered-out your poorly-detected features before running PCA? If so, was this outlier robust to your arbitrary cutpoint?

ADD COMMENTlink written 21 months ago by russhh4.9k

I neither see outliers here. There looks to be good separation between B6, ET, and GF along both PC1 and PC2, in fact. Proceed with your analysis as it is.

Cook's Distance is one of the chief metrics used to assess outliers. Going by a PCA bi-plot, one would look are various things:

  • % variation explained along axes
  • eigenvalues of each sample to each PC
  • standard deviation of each sample's eigenvalue (>3 can be assumed outlier)

Edit: you pasted your code just as I was typing. Although there are outlier genes in that sample, the sample itself is not an outlier. In addition, there are most likely outlier genes in the other samples, too. Biology doesn't follow the rules that we'd like it [to follow]. Whilst we desperately would like everything to fit neatly into our statistical models, things never do. I see minimal justification for removing any sample from your study based on the information that you've provided. If you wanted to re-do normalisation and set a harder threshold for filtering low count transcripts, then that may work.

Note Bene: it looks like you've used the PCA function that's in-built into DESeq2. That function automatically removes a large proportion of your genes based on low variance without even informing the user (it should inform because all users would naturally assume that all transcripts are being used). If you wanted to perform an entirely unbiased PCA analysis, then use my code that utilises base R functions: A: PCA plot from read count matrix from RNA-Seq (start from your regularised log counts)

ADD REPLYlink modified 21 months ago • written 21 months ago by Kevin Blighe53k

Sorry Kevin

We have raw read counts from sequencing using a technique that only sequence 1402 immune related genes (HTG)

I used DESeq2 and I am seeing my samples are too disperse

enter image description here

And as you are seeing in this link I am not able to separate genes based on the differential expression at all

What one can do in such a samples please ?

ADD REPLYlink written 3 months ago by A3.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1646 users visited in the last hour