Question: DESeq2: Continous values
gravatar for VHahaut
2.9 years ago by
VHahaut1.1k wrote:


I am currently analyzing a dataset containing the following DESeq2 design:

sample    group    continuous_value
sample_a    A    35
sample_b    A    10
sample_c    B    2
sample_d    B    5

design(Experiment) <- formula(~ continuous_value + group)

Each sample belong to a group containing 5 individuals: Group A contains the WT samples and Group B the knock down samples.

For each sample a continuous value (in percentage) is associated. This value depicts the percentage of cells in this sample that are the one I'm interested in. In other words, each sample contain cells from the same cell type but only x% of them are the one that have the phenotype I want to analyse. Since the % of cells of interest varies from one sample to another I would like to normalize the results in consequence. 

The question is the following: How DESeq2 handles these continuous values? Is this design the most appropriate?

I am afraid I am not sure to fully understand the DESeq2 vignette part that talks about it.

I already tested three approaches: 

  • With this % 
  • Without this %
  • Transform the % into small number of bins as advice in the vignette. Unfortunately, I got the error:  "Error in DESeqDataSet(se, design = design, ignoreRank) : the model matrix is not full rank, so the model cannot be fit as or more variables or interaction terms in the design formula are linear combinations of the others and must be removed". Moreover, we currently don't have any biological information that could allow us to cluster those % into groups and so the cut-off between the groups are arbitrary. 


Thanks in advance for your answers!


design deseq2 R • 1.5k views
ADD COMMENTlink modified 2.9 years ago by andrew.j.skelton735.2k • written 2.9 years ago by VHahaut1.1k

I strongly encourage you to find a local collaborator. It'll take some playing around with the data for someone to come up with an optimal solution.

When you have a significant nuisance covariate like this you pretty much have to include it somehow in the design for the results to be useful, so the second test you tried can be ignored. It's often the case that creating groups like you tried in test 3 is the simplest route, though as you've noticed you have to be fairly familiar with how the underlying statistics work to not have these groups confound the calculation of the group-effect that you actually care about. Sometimes it turns out that a simple transformation of the continuous values (e.g., with log2) provides more reasonable results, but again you really need someone familiar with messier designs like this to directly work with the data. In an ideal world, he/she can then tell you what was tried and why/how the best design was arrived at (since you'll learn a LOT from that process).

BTW, make some PCA plots and see how things group according to the covariate. Sometimes that's enough to figure out how to handle things.

ADD REPLYlink written 2.9 years ago by Devon Ryan82k
gravatar for andrew.j.skelton73
2.9 years ago by
andrew.j.skelton735.2k wrote:

Providing you have a good number of observations, what you're asking is: "Are there any correlations with normalised counts, relative to my continuous variable (and based on your design formula), regardless of Group?" If that's right, then first things first, you need to make sure that your continuous variable in your design matrix is numeric. You should also consider transforming the continuous variable, depending on the distribution of those values; log transform perhaps? You're most likely getting a rank error because your continuous variable is a factor in your design matrix, if that's not the case, then there are deeper issues with your underlying experimental design. 

ADD COMMENTlink written 2.9 years ago by andrew.j.skelton735.2k

Sorry if my question was not well formulated but I think you have not understood what I wanted to say. 

I don't want to know if there are any correlation with the continuous variable regardless of the groups. I want to normalize the count matrix in such way that this variable number of cells doesn't impact on my comparison of the groups.

My main to goal is to extract the differentially expressed genes between the groups, knowing that I have a starting bias in my sample because they are not pure population. I though that putting the percentage of cells in my design would do it.

Does it make more sens? 

Anyway I thank you for your answer and will still go back to my design to verify if I have encoded my values as numeric and not as factors.


ADD REPLYlink written 2.9 years ago by VHahaut1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1457 users visited in the last hour