Question

Use DESeqDataSetFromMatrix from a matrix of normalized read counts

0

Entering edit mode

4.0 years ago

celia.escher ▴ 20

I got my differential gene expression tables from RNAseq data already processed by another person and would like to perform predictive model analysis from a list of genes. I was advised before to use my normalized read count matrix (columns=samples, rows=genes) for rlog or VST transformations before building the model. However, I do not manage to use the DESeqDataSetFromMatrix function due to invalid design before rlog or VST.

I have 4 groups of samples:

1) healthy controls: no disease, no treatments

2) active untreated samples: active disease, no treatments

3) inactive treated samples: inactive disease, treated

4) active refractory samples: active disease, treatment was not successful and they stopped this treatment

I would like to obtain 2 different predictive models: one for genes that could predict the response (or not) to treatment (prediction between groups 3 and 4), and the other one for genes that could predict the relapse of symptoms or, in another way, what is the treatment not treating (prediction between groups 1 and 3).

Thus, I am trying to process all data at once with DESeqDataSetFromMatrix to subsequently apply rlog or VST. I wrote a design table in different ways but always gives me back the error in checkFullRank(modelMatrix) :

the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed. Please read the vignette section 'Model matrix not full rank': vignette('DESeq2')

However, I do not fully understand the information there and this is why I am asking for support. This is the code I wrote for the design as if I had 2 samples per group. I am not sure whether suppressing Disease would be correct....

my_design <- data.frame(row.names = colnames(test),
                      Disease = c(rep("no", 2), rep("yes", 6)),
                      Activity = c(rep("no", 2), rep("yes", 2), rep("no", 2), rep("yes", 2)),
                      Treatment = c(rep("no", 4), rep("yes", 2), rep("no", 2)))

my_cds <- DESeqDataSetFromMatrix(test, my_design, ~ Disease + Activity + Treatment)

R RNA-Seq • 1.6k views

ADD COMMENT • link 4.0 years ago by celia.escher ▴ 20

score 0 · Answer 1 · 2020-04-23

0

Entering edit mode

4.0 years ago

ATpoint 81k

You cannot use these functions as they assume raw counts.

ADD COMMENT • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

But I was advised to perform it following this link A: Building a predictive model by a list of genes where it says that "With DESeq2, transform your normalised counts via the vst or rlog transformations, and then use the resulting data for the model building."

In any case, I also have the raw counts in the same data file.

Then, with normalized counts I could skip that step I guess... I am pretty lost since I have never tried these things before.

ADD REPLY • link 4.0 years ago by celia.escher ▴ 20

1

Entering edit mode

I edited the text in my other answer to make it more clear that DESeq2 takes raw counts.

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

What Kevin Blighe describes there is what the functions do internally. They take raw counts, normalize them with the RLE method from DESeq2 and then feed this into the actual rlog/vst procedure. You can of course try and disect the source code to feed in whatever counts you have but this is only recommended if you really know what you do. I am not sure how these functions work internally so if they directly model the normalized counts or use the size factors as offsets for any kind of model, still the default vst and rlog you need raw counts as said above. If you already have normalized counts why not using the log2+1 transformed normalized counts for your model? This is pretty common.

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

Indeed, if your data in test is already normalised (how was it normalised?), then you could do a simple transformation via log2(x + 1) (not ideal), and then use that for modeling, thus avoiding the DESeq2 stage. Please tell us more about how the data in test was produced.

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

I called test to a gene expression data frame (samples in columns and genes in rows) of rounded normalized counts obtained after running DEG analysis by a bioinformatician using DESeq2.

I have used log2(x+1) transformation from these rounded normalized counts to plot data (e.g. heatmaps), but I am also reconsidering whether I should have used rlog- or VST-transformed values for that.

Thank you two for your help!

ADD REPLY • link 4.0 years ago by celia.escher ▴ 20

0

Entering edit mode

Oh, it may be better to simply request that the bioinformatician send you the rlog or vst expression levels, and then use those. It should only take her/him a few minutes to do it.