Question: DESeq2 Appropriate Settings for Poorly Clustering Samples?
gravatar for Lindsay865
14 months ago by
United Kingdom / Newry
Lindsay86520 wrote:

Hello all,

Apologies in advance for my lack of expertise and knowledge as I am a student bioinformatician. I have tried to understand other answers as well as the DESeq2 vignette and publications already so I have managed to run it successfully, I just have a problem with choosing the settings for my particular experiment.

At the moment I have RNA-seq data (ENSG IDs and raw counts) for leukaemia patients in 4 different primary cytogenetic groups inv(16) (n=17), t(8;21) (n=27), MLL (n=29) and Normal (n=19) - unfortunately I only have one tumour sample per patient and no matching normal tissue data. There is a lot of metadata on these patients - they are of different genders/races/ethnicities/FAB categories/protocols and some have other mutations involved. One variable which I am expected to use to form groups is Event status - I am expected to use 'Censored' patients as the control group and compare 'Relapsed' patients and patients who 'Failed Induction Therapy' to this group.

I have been working on the inv(16) patients first in which there are 5 Censored and 10 Relapsed patients. Two patients were removed as they were extreme outliers when viewed by PCA plot and sample distances heatmap. So here is my code so far:

#Setting up colData object
#Previously identified outliers C10 (4) and D1 (17) removed

#Setting up countData object
dds<-DESeqDataSetFromMatrix(inv16DF[,-c(4,17)], colData=metadatainv16, design=~Event)

#Removing genes with sum total of 5 reads

#Running DESeq2
dds<-DESeq(dds, fitType="parametric", betaPrior=TRUE, minReplicatesForReplace=7)

#Results in order of padj/how many genes < 0.05
resinv16<-results(dds, cooksCutoff=TRUE, independentFiltering=TRUE, alpha=0.05)

#Merge with normalised counts and subset padj < 0.05
resdata<-merge(,, normalized=TRUE)), 
resdatainv16sig<-subset(resdata, resdata$padj<0.05)

#P value histogram
hist(resdata$pvalue, breaks=40, col="skyblue", xlab="P value", main="Histogram of P values for inv(16)")

#Dispersion plot
plotDispEsts(dds, main="Dispersion Plot inv(16)",ylim=c(1e-6, 1e1))

#Regularised log transformation for heatmap and PCA

#PCA ntd
plotPCA(ntd, intgroup="Event")

#PCA rld
plotPCA(rld, intgroup="Event")

#Sample Distance Heatmap

colours <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
         main="Sample-to-sample distances inv(16)")

#MA Plot
plotMA(resinv16, main="MA-plot for Event with inv(16)", ylim=range(resinv16$log2FoldChange, na.rm=TRUE))

#Count plot for gene with minimum padj
plotCounts(dds, gene=which.min(resinv16$padj), intgroup="Event")

Which gives the following outputs:

> summary(resinv16)

out of 28453 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 9, 0.032% 
LFC < 0 (down)   : 28, 0.098% 
outliers [1]     : 680, 2.4% 
low counts [2]   : 7463, 26% 
(mean count < 5)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> resinv16[order(resinv16$padj),]

log2 fold change (MAP): Event Relapse vs Censored 
Wald test p-value: Event Relapse vs Censored 
DataFrame with 28548 rows and 6 columns
                 baseMean log2FoldChange     lfcSE       stat       pvalue        padj
                <numeric>      <numeric> <numeric>  <numeric>    <numeric>   <numeric>
ENSG00000159189 129.04193      -2.959093 0.6150973  -4.810772 1.503483e-06 0.007669642
ENSG00000160791  89.11385      -2.126363 0.4185705  -5.080059 3.773175e-07 0.007669642
ENSG00000232422  59.67893      -2.824722 0.5868172  -4.813633 1.482112e-06 0.007669642
ENSG00000264063  40.50715      -2.257767 0.4602950  -4.905043 9.340686e-07 0.007669642
ENSG00000234699  13.63526      -2.632995 0.5546159  -4.747420 2.060274e-06 0.008407977
...                   ...            ...       ...        ...          ...         ...
ENSG00000267778 0.8509069     -0.1971960 0.4453060 -0.4428325    0.6578869          NA
ENSG00000267785 0.6669036     -0.4189023 0.4848863 -0.8639187    0.3876326          NA
ENSG00000267791 0.6508162      0.0626245 0.4256238  0.1471358    0.8830248          NA
ENSG00000267793 4.9778852     -0.6359067 0.5809850 -1.0945320    0.2737217          NA
ENSG00000267799 4.6509096      0.7252028 0.5571731  1.3015754    0.1930616          NA

and these graphical outputs:

I have a few questions despite trying my best to learn how to optimise the settings as my results seem to be quite poor.

1: I am unsure how to tell if variables (gender/race/ethnicity/etc.) significantly contribute to my results. I could possibly include them in design= to account for variation here. Can't see any clear connection between these variables and how the patients cluster or appear in PCA plots.

2: The patients don't cluster at all, which may be related to my first point, and I think this is a major problem when trying to find DEGs, however I am still expected to group patients according to event status despite highlighting this issue. I've tried many combinations of samples and I've tried to edit the design, eg. design=~Gender+Race+Ethnicity+Event or different combinations of this.The PCA plot and clustering within the sample distances heatmap clustering does not seem to improve upon choosing only individuals of the same race and ethnicity. One thought I had was to manually pick out samples from the PCA plot which look similar, for eg. 3 Censored Patients and 3 Relapsed patients, although I know this has negative implications for the power of the analysis and would probably be majorly biased. It did produce more DEGs, and better clustering. Eg:

> summary(resinv16)

out of 25289 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 3337, 13% 
LFC < 0 (down)   : 3239, 13% 
outliers [1]     : 256, 1% 
low counts [2]   : 4903, 19% 
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

and produce the following graphical outputs:

I don't know how to approach this analysis. Could anyone give me advice on this? Are the patients in the Censored/Relapsed/Induction Failure groups considered biological replicates?

3: I have tried minReplicatesforReplace=Inf and cooksCutoff=FALSE when I observed hundreds and thousands of outliers - is this correct? I have also made boxplots with these settings and found no other major outlying samples so assume I don't need to remove samples due to this. This is the code:

boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2, main="Cook's Distances inv(16)")

and boxplot:

4: Are the settings appropriate for this analysis and amount of samples?

I apologise once again for any redundant questions I am asking and a massive thanks in advance for any help given!

dge analysis rna-seq deseq2 R • 1.5k views
ADD COMMENTlink modified 14 months ago by Kevin Blighe48k • written 14 months ago by Lindsay86520
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe48k
Kevin Blighe48k wrote:

Hey Lindsay865,

I will do my utmost to cover all hints and tips that could help you to better understand your data.

At first glance, I notice that one of your choices for design formula is a potential issue, i.e., design=~Gender+Race+Ethnicity+Event. For RNA-seq data normalisation and statistical inference, one does not want to 'over' adjust with a design formula that has too many covariates. From what I understand of the biology of gene expression, race and ethnicity have minimal influence on this; thus, you should not include them in your design formula unless your specific aim is to study the effects of race or ethnicity on gene expression in your samples. Gender, you have some justification for leaving in the design formula, as genes from chromosome X and Y will obviously differ between males and females. Thus, your design formula should be at most:


...and minimally:


When normalising and then performing the statistical comparisons, you should now be using the lfcShrink function:

dds <- DESeq(dds)
res <- results(dds, contrast=c("Event", "Relapse", "Censored"), independentFiltering=TRUE, alpha=0.1, pAdjustMethod="BH", parallel=TRUE)
res <- lfcShrink(dds, contrast=c("Event", "Relapse", "Censored"), res=res)

NB - If your test is 'apeglm', then you cannot specify a contrast and must specify a coefficient, e.g.:

res <- results(dds, name="Event_Relapset_vs_Censored")
res <- lfcShrink(dds, coef="Event_Relapset_vs_Censored", type="apeglm", res=res)

When doing this, betaPrior's default to the DESeq() function in the latest versions of DESeq2 is betaPrior=FALSE. If you want to go back to the original way of generating results without using lfcshrink(), then use DESeq2(..., betaPrior=TRUE) and ignore the use of lfcshrink()

When you do the variance-stabilised or regularised log transformation, note the extra parameter called 'blind'. For example, if we use rlog(dds, blind=TRUE), then the transformation is performed independent of the design formula. If blind=FALSE, then the transformation will be adjusted based on your design formula. According to Michael Love, for the purposes of general QC of data, like via PCA, then blind should be TRUE; however, when producing transformed counts for downstream functions, one should use blind=FALSE. Note the difference yourself by using both of these and then re-generating the PCA bi-plots.

This point then ties in with how one can gauge whether or not a particular variable is biasing your counts (your Point 1). If, after you transform your data and have used blind=TRUE, you notice that some variable is biasing your samples via inspection of the PCA bi-plot, you should then go back and include that variable in the design formula. Looking at your provided PCA bi-plots, I don't see anything out of the ordinary - there are certainly no sample outliers.

Samples should not always cluster the way that one would expect via PCA. In fact, by removing samples because one is not happy with the way that the samples appear on the PCA bi-plot, you could be accused of cherry-picking samples / data in order to fit what you want to see. This ties in with your Point 2. If there is no evidence of an outlier in your initial PCA bi-plot, then you don't really have justification for removing samples. You could explore other things, like potential sample mix-ups (which occurs frequently), or indeed if some covariate is biasing your samples (your Point 1).

Your Point 3. From my experience, it can be 'risky' switching off Cook's Distance, and what I find is that genes usually fail Cook's Distance due to an overly complex design formula, a high variance in the particular gene, or low counts. In 'every' experiment, some genes will fail Cook's Distance and be marked as outliers, a situation which may be exaggerated with low sample n, as you have.

You may consider raising the threshold for low counts when running:


Certainly not a final answer and other opinions welcome.


ADD COMMENTlink modified 7 months ago • written 14 months ago by Kevin Blighe48k

Hi Kevin,

Your advice has been incredibly helpful - thank you for not being harsh with your criticism! (I expected to be slated...).

So just for clarification, you're suggesting that I completely remove my:

dds<-DESeq(dds, fitType="parametric", betaPrior=TRUE, minReplicatesForReplace=7)
resinv16<-results(dds, cooksCutoff=TRUE, independentFiltering=TRUE, alpha=0.05)

and use the 3 lines of code you mentioned above (which includes the lfcShrink function) if I am obtaining lists of DEGs to use for gene set enrichment, etc? Do I also use this method before plotting dispersion plot and MA plots or transforming the data (rlogTransformation(dds)) and plotting PCA and sample distances heatmap? Apologies if I'm not totally following you there.. I hadn't even considered lfcShrink beforehand.

I've tried your 3 lines of code and then used both blind=TRUE and blind=FALSE while plotting my rlogTransformation(dds) data and didn't notice much difference in how the samples appeared so I don't think there is reason to include an additional variable (apart from Event and perhaps Gender also) in my design formula.

In terms of increasing the threshold for low counts, I increased it from a row sum of 5 (leaving 28548 genes) to a row sum of 50 (leaving 22123 genes) which decreased outliers from 680 to 410..this might have been a high threshold. However, I did find posts such as which make me think it isn't really necessary?

Thanks so much for your time, it really is appreciated!


ADD REPLYlink modified 14 months ago • written 14 months ago by Lindsay86520

Great - I'm glad.

Yes, I would just use the lines that I provided, i.e., the ones that include lfcShrink(). The betaPrior parameter's default should be betaPrior=FALSE for the DESeq() function itself.

The dispersion and MA plots are based on the normalised counts prior to any transformation, so, they will always remain the same irrespective of the design formula.

If you have heard otherwise regarding the filtering of low counts, in particular if it has come from Michael Love (DESeq 'Mastermind'), then I would always regard what he says as the absolute truth.

ADD REPLYlink modified 14 months ago • written 14 months ago by Kevin Blighe48k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1787 users visited in the last hour