Question: Sample outlier detection in DESeq2
0
gravatar for nikitavlassenko
11 months ago by
nikitavlassenko10 wrote:

I have 16 samples of bulk-RNA-seq data. There are 4 groups, so 4 samples are in each group. I want to check for outliers in each of the groups, and also get some characteristic estimate for each sample in each group of how much of an outlier it is. I am using DESeq2. There are threads that I found in Google, but I do not think they answer my question:

https://support.bioconductor.org/p/35918/

With this one I get a bunch of plots which all look approximately the same. abline() for some reason is not drawn for me, not sure what is the purpose of it (showing genes outliers? I am not interested in it).

If we look to the official documentation:

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers

The assays(dds)[["cooks"]] gives the measures for outliers for each gene and sample, but I need to get it just for each sample in each group. I did a boxplot and it looks like that:

boxplot

The PCA looks like that:
pca

Which makes me think that there is an outlier in B6 group, the rightmost sample. The output from summary(res) gives the following:
summary

What metrics to use to assess how much of an outlier each sample is in each group? I am a bit confused.

rna-seq deseq2 • 1.2k views
ADD COMMENTlink modified 11 months ago by genomax63k • written 11 months ago by nikitavlassenko10
1

For future reference : How to add images to a Biostars post

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax63k
1
gravatar for russhh
11 months ago by
russhh4.2k
UK, U. Glasgow
russhh4.2k wrote:

Don't be too quick to throw away samples. That B6 sample is no more an outlier than the ET sample on the far-left, by eye. Run PCA on subsamples: subselect genes and use leave-one-out over the samples (you mentioned 16 samples, but presented 12); for each plie of leave-one-out, check which group your held-out sample clusters with (which is the nearest other point, or which is the nearest mean point in the PCA). What values are you passing to pca: log-counts or log-fpkm or on some zero-shifted version of the counts? Have you filtered-out your poorly-detected features before running PCA? If so, was this outlier robust to your arbitrary cutpoint?

ADD COMMENTlink written 11 months ago by russhh4.2k
1

I neither see outliers here. There looks to be good separation between B6, ET, and GF along both PC1 and PC2, in fact. Proceed with your analysis as it is.

Cook's Distance is one of the chief metrics used to assess outliers. Going by a PCA bi-plot, one would look are various things:

  • % variation explained along axes
  • eigenvalues of each sample to each PC
  • standard deviation of each sample's eigenvalue (>3 can be assumed outlier)

Edit: you pasted your code just as I was typing. Although there are outlier genes in that sample, the sample itself is not an outlier. In addition, there are most likely outlier genes in the other samples, too. Biology doesn't follow the rules that we'd like it [to follow]. Whilst we desperately would like everything to fit neatly into our statistical models, things never do. I see minimal justification for removing any sample from your study based on the information that you've provided. If you wanted to re-do normalisation and set a harder threshold for filtering low count transcripts, then that may work.

Note Bene: it looks like you've used the PCA function that's in-built into DESeq2. That function automatically removes a large proportion of your genes based on low variance without even informing the user (it should inform because all users would naturally assume that all transcripts are being used). If you wanted to perform an entirely unbiased PCA analysis, then use my code that utilises base R functions: A: PCA plot from read count matrix from RNA-Seq (start from your regularised log counts)

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1612 users visited in the last hour