Question: DESeq2 Appropriate Settings for Poorly Clustering Samples?
gravatar for Lindsay865
2.5 years ago by
United Kingdom / Newry
Lindsay86560 wrote:

Hello all,

Apologies in advance for my lack of expertise and knowledge as I am a student bioinformatician. I have tried to understand other answers as well as the DESeq2 vignette and publications already so I have managed to run it successfully, I just have a problem with choosing the settings for my particular experiment.

At the moment I have RNA-seq data (ENSG IDs and raw counts) for leukaemia patients in 4 different primary cytogenetic groups inv(16) (n=17), t(8;21) (n=27), MLL (n=29) and Normal (n=19) - unfortunately I only have one tumour sample per patient and no matching normal tissue data. There is a lot of metadata on these patients - they are of different genders/races/ethnicities/FAB categories/protocols and some have other mutations involved. One variable which I am expected to use to form groups is Event status - I am expected to use 'Censored' patients as the control group and compare 'Relapsed' patients and patients who 'Failed Induction Therapy' to this group.

I have been working on the inv(16) patients first in which there are 5 Censored and 10 Relapsed patients. Two patients were removed as they were extreme outliers when viewed by PCA plot and sample distances heatmap. So here is my code so far:

#Setting up colData object
#Previously identified outliers C10 (4) and D1 (17) removed

#Setting up countData object
dds<-DESeqDataSetFromMatrix(inv16DF[,-c(4,17)], colData=metadatainv16, design=~Event)

#Removing genes with sum total of 5 reads

#Running DESeq2
dds<-DESeq(dds, fitType="parametric", betaPrior=TRUE, minReplicatesForReplace=7)

#Results in order of padj/how many genes < 0.05
resinv16<-results(dds, cooksCutoff=TRUE, independentFiltering=TRUE, alpha=0.05)

#Merge with normalised counts and subset padj < 0.05
resdata<-merge(,, normalized=TRUE)), 
resdatainv16sig<-subset(resdata, resdata$padj<0.05)

#P value histogram
hist(resdata$pvalue, breaks=40, col="skyblue", xlab="P value", main="Histogram of P values for inv(16)")

#Dispersion plot
plotDispEsts(dds, main="Dispersion Plot inv(16)",ylim=c(1e-6, 1e1))

#Regularised log transformation for heatmap and PCA

#PCA ntd
plotPCA(ntd, intgroup="Event")

#PCA rld
plotPCA(rld, intgroup="Event")

#Sample Distance Heatmap

colours <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
         main="Sample-to-sample distances inv(16)")

#MA Plot
plotMA(resinv16, main="MA-plot for Event with inv(16)", ylim=range(resinv16$log2FoldChange, na.rm=TRUE))

#Count plot for gene with minimum padj
plotCounts(dds, gene=which.min(resinv16$padj), intgroup="Event")

Which gives the following outputs:

> summary(resinv16)

out of 28453 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 9, 0.032% 
LFC < 0 (down)   : 28, 0.098% 
outliers [1]     : 680, 2.4% 
low counts [2]   : 7463, 26% 
(mean count < 5)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> resinv16[order(resinv16$padj),]

log2 fold change (MAP): Event Relapse vs Censored 
Wald test p-value: Event Relapse vs Censored 
DataFrame with 28548 rows and 6 columns
                 baseMean log2FoldChange     lfcSE       stat       pvalue        padj
                <numeric>      <numeric> <numeric>  <numeric>    <numeric>   <numeric>
ENSG00000159189 129.04193      -2.959093 0.6150973  -4.810772 1.503483e-06 0.007669642
ENSG00000160791  89.11385      -2.126363 0.4185705  -5.080059 3.773175e-07 0.007669642
ENSG00000232422  59.67893      -2.824722 0.5868172  -4.813633 1.482112e-06 0.007669642
ENSG00000264063  40.50715      -2.257767 0.4602950  -4.905043 9.340686e-07 0.007669642
ENSG00000234699  13.63526      -2.632995 0.5546159  -4.747420 2.060274e-06 0.008407977
...                   ...            ...       ...        ...          ...         ...
ENSG00000267778 0.8509069     -0.1971960 0.4453060 -0.4428325    0.6578869          NA
ENSG00000267785 0.6669036     -0.4189023 0.4848863 -0.8639187    0.3876326          NA
ENSG00000267791 0.6508162      0.0626245 0.4256238  0.1471358    0.8830248          NA
ENSG00000267793 4.9778852     -0.6359067 0.5809850 -1.0945320    0.2737217          NA
ENSG00000267799 4.6509096      0.7252028 0.5571731  1.3015754    0.1930616          NA

and these graphical outputs:

I have a few questions despite trying my best to learn how to optimise the settings as my results seem to be quite poor.

1: I am unsure how to tell if variables (gender/race/ethnicity/etc.) significantly contribute to my results. I could possibly include them in design= to account for variation here. Can't see any clear connection between these variables and how the patients cluster or appear in PCA plots.

2: The patients don't cluster at all, which may be related to my first point, and I think this is a major problem when trying to find DEGs, however I am still expected to group patients according to event status despite highlighting this issue. I've tried many combinations of samples and I've tried to edit the design, eg. design=~Gender+Race+Ethnicity+Event or different combinations of this.The PCA plot and clustering within the sample distances heatmap clustering does not seem to improve upon choosing only individuals of the same race and ethnicity. One thought I had was to manually pick out samples from the PCA plot which look similar, for eg. 3 Censored Patients and 3 Relapsed patients, although I know this has negative implications for the power of the analysis and would probably be majorly biased. It did produce more DEGs, and better clustering. Eg:

> summary(resinv16)

out of 25289 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 3337, 13% 
LFC < 0 (down)   : 3239, 13% 
outliers [1]     : 256, 1% 
low counts [2]   : 4903, 19% 
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

and produce the following graphical outputs:

I don't know how to approach this analysis. Could anyone give me advice on this? Are the patients in the Censored/Relapsed/Induction Failure groups considered biological replicates?

3: I have tried minReplicatesforReplace=Inf and cooksCutoff=FALSE when I observed hundreds and thousands of outliers - is this correct? I have also made boxplots with these settings and found no other major outlying samples so assume I don't need to remove samples due to this. This is the code:

boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2, main="Cook's Distances inv(16)")

and boxplot:

4: Are the settings appropriate for this analysis and amount of samples?

I apologise once again for any redundant questions I am asking and a massive thanks in advance for any help given!

dge analysis rna-seq deseq2 R • 3.0k views
ADD COMMENTlink modified 2.5 years ago by Kevin Blighe69k • written 2.5 years ago by Lindsay86560
gravatar for Kevin Blighe
2.5 years ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Hey Lindsay865,

I will do my utmost to cover all hints and tips that could help you to better understand your data.

At first glance, I notice that one of your choices for design formula is a potential issue, i.e., design=~Gender+Race+Ethnicity+Event. For RNA-seq data normalisation and statistical inference, one does not want to 'over' adjust with a design formula that has too many covariates. From what I understand of the biology of gene expression, race and ethnicity have minimal influence on this; thus, you should not include them in your design formula unless your specific aim is to study the effects of race or ethnicity on gene expression in your samples. Gender, you have some justification for leaving in the design formula, as genes from chromosome X and Y will obviously differ between males and females. Thus, your design formula should be at most:


...and minimally:


When normalising and then performing the statistical comparisons, you should now be using the lfcShrink function:

dds <- DESeq(dds)
res <- results(dds, contrast=c("Event", "Relapse", "Censored"), independentFiltering=TRUE, alpha=0.1, pAdjustMethod="BH", parallel=TRUE)
res <- lfcShrink(dds, contrast=c("Event", "Relapse", "Censored"), res=res)

NB - If your test is 'apeglm', then you cannot specify a contrast and must specify a coefficient, e.g.:

res <- results(dds, name="Event_Relapset_vs_Censored")
res <- lfcShrink(dds, coef="Event_Relapset_vs_Censored", type="apeglm", res=res)

When doing this, betaPrior's default to the DESeq() function in the latest versions of DESeq2 is betaPrior=FALSE. If you want to go back to the original way of generating results without using lfcshrink(), then use DESeq2(..., betaPrior=TRUE) and ignore the use of lfcshrink()

When you do the variance-stabilised or regularised log transformation, note the extra parameter called 'blind'. For example, if we use rlog(dds, blind = TRUE), then the transformation is performed independent of the design formula. If blind = FALSE, then the transformation will be exposed to your design formula. According to Michael Love, for the purposes of general QC of data, like via PCA, then blind should be TRUE; however, when producing transformed expression levels for downstream functions, one should use blind = FALSE. Note the difference yourself by using both of these and then re-generating the PCA bi-plots. See here for further details:

This point then ties in with how one can gauge whether or not a particular variable is biasing your counts (your Point 1). If, after you transform your data and have used blind = TRUE, you notice that some variable is biasing your samples via inspection of the PCA bi-plot, you may then go back and include that variable in the design formula in order to control for it. Further, stronger lines of evidence of such a variable being a potential biasing factor should also be sought. Looking at your provided PCA bi-plots, I don't see anything out of the ordinary - there are certainly no sample outliers.

Samples should not always cluster the way that one would expect via PCA. In fact, by removing samples because one is not happy with the way that the samples appear on the PCA bi-plot, you could be accused of cherry-picking samples / data in order to fit what you want to see. This ties in with your Point 2. If there is no evidence of an outlier in your initial PCA bi-plot, then you don't really have justification for removing samples. You could explore other things, like potential sample mix-ups (which occurs frequently), or indeed if some covariate is biasing your samples (your Point 1).

Your Point 3. From my experience, it can be 'risky' switching off Cook's Distance, and what I find is that genes usually fail Cook's Distance due to an overly complex design formula, a high variance in the particular gene, or low counts. In 'every' experiment, some genes will fail Cook's Distance and be marked as outliers, a situation which may be exaggerated with low sample n, as you have.

You may consider raising the threshold for low counts when running:


Certainly not a final answer and other opinions welcome.


ADD COMMENTlink modified 4 months ago • written 2.5 years ago by Kevin Blighe69k

Hi Kevin,

Your advice has been incredibly helpful - thank you for not being harsh with your criticism! (I expected to be slated...).

So just for clarification, you're suggesting that I completely remove my:

dds<-DESeq(dds, fitType="parametric", betaPrior=TRUE, minReplicatesForReplace=7)
resinv16<-results(dds, cooksCutoff=TRUE, independentFiltering=TRUE, alpha=0.05)

and use the 3 lines of code you mentioned above (which includes the lfcShrink function) if I am obtaining lists of DEGs to use for gene set enrichment, etc? Do I also use this method before plotting dispersion plot and MA plots or transforming the data (rlogTransformation(dds)) and plotting PCA and sample distances heatmap? Apologies if I'm not totally following you there.. I hadn't even considered lfcShrink beforehand.

I've tried your 3 lines of code and then used both blind=TRUE and blind=FALSE while plotting my rlogTransformation(dds) data and didn't notice much difference in how the samples appeared so I don't think there is reason to include an additional variable (apart from Event and perhaps Gender also) in my design formula.

In terms of increasing the threshold for low counts, I increased it from a row sum of 5 (leaving 28548 genes) to a row sum of 50 (leaving 22123 genes) which decreased outliers from 680 to 410..this might have been a high threshold. However, I did find posts such as which make me think it isn't really necessary?

Thanks so much for your time, it really is appreciated!


ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Lindsay86560

Great - I'm glad.

Yes, I would just use the lines that I provided, i.e., the ones that include lfcShrink(). The betaPrior parameter's default should be betaPrior=FALSE for the DESeq() function itself.

The dispersion and MA plots are based on the normalised counts prior to any transformation, so, they will always remain the same irrespective of the design formula.

If you have heard otherwise regarding the filtering of low counts, in particular if it has come from Michael Love (DESeq 'Mastermind'), then I would always regard what he says as the absolute truth.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Kevin Blighe69k

@Kevin Blighe: Why are you so awesome! Thank you very much!

ADD REPLYlink written 8 months ago by bioinfouser70

Just doing my best and working hard to help others - that's all that can be asked of us.

ADD REPLYlink written 8 months ago by Kevin Blighe69k

Note that the blind = FALSE procedure is not 'fool proof' and will likely not get rid of any major batch effect in your data. If you need to directly modify the transformed expression levels for downstream analyses, then you should use removeBatchEffect(), as to which I outline here: A: Trying to understand the maths behind one Limma function

ADD REPLYlink written 8 months ago by Kevin Blighe69k

@Kevin Blighe: Just asking a bit naive question then. Apart from batch effect, when I use Deseq2 to model many other covariates(like age, sex) and do differential expression analysis, are those transformed expression values usable to make boxplot for a particular gene of interest? Or one needs to do further processing, like using removeBatchEffect(), to use those transformed expression values made by Deseq2?

ADD REPLYlink written 8 months ago by bioinfouser70

Hello again,

If, for example, we have a model formula like this:

 ~ condition + sex + age^2

Then, if we obtain test statistics for the condition term, these test statistics will be adjusted for age^2 and sex. Same as if we wanted to perform a DE analysis on sex, the test stats would be adjusted for condition and age^2.

If you then perform rlog (or vst) with blind = TRUE (transformation is blind to the formula that you used), the transformation can be regarded as being performed on the normalised counts 'as is'. For QC purposes, box-and-whisker plots, scatter plots, heatmaps, PCA, etc, are then typically performed on these transformed expression levels.

If you want to use your data for other stuff, like, k-means clustering, downstream regression modeling, etc., then transform via blind = FALSE (the transformation sees and utilises the formula that you used), which can sometimes act as a soft form of adjustment for variables in your design. Ultimately, what its doing is controlling for dispersion. Obviously, if it completely adjusted the data, then we'd eliminate the condition effect, too, which we don't want. I checked again the vignette and it is explained there (under heading Blind dispersion estimation): Analyzing RNA-seq data with DESeq2

Each dataset is also different, so, there's no way to provide a 'catch all' answer. I would be checking PCA bi-plots with and without blind TRUE|FALSE, and then, if a large batch effect exists, that can be removed via removeBatchEffect() performed on the vst or rlog counts. In certain other cases, it may even be more appropriate to stratify the entire experiment and perform analyses separately,, e.g., if the study comprised data from completely disparate tissues with different transcriptional programs.

For things like age and sex, though, I'd say that simply using blind = FALSE is fine, in terms of producing data for downstream box-and-whisker / scatter plots.

ADD REPLYlink written 8 months ago by Kevin Blighe69k

Thank you, Kevin! I immensely appreciate your detailed explanation! Also, I found in the Vignette:

Do normalized counts correct for variables in the design? No. The design variables are not used when estimating the size factors, and counts(dds, normalized=TRUE) is providing counts scaled by size or normalization factors. The design is only used when estimating dispersion and log2 fold changes.

The only case in which there is more than size factor scaling on the counts is when either normalization factors have been provided (e.g. from cqn or EDASeq), or if tximport is used and the upstream software corrected for various technical biases (e.g. Salmon quantification with GC bias correction). In this case, the average transcript length is taken into account when scaling the counts with counts(dds, normalized=TRUE). For details, see the tximport package vignette and citation [@Soneson2015].

ADD REPLYlink modified 8 months ago • written 8 months ago by bioinfouser70

Dear kevin, one small question.

Does your package "pcatools" allow taking the deseq2 object directly to make pca plots? Or one needs to some steps? I find your package fascinating and want to use it, that's why wondering.

ADD REPLYlink written 8 months ago by bioinfouser70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1476 users visited in the last hour