Question: DESeq2 for differential gene expression on GTEx dataset
gravatar for vikram
3 months ago by
vikram0 wrote:


I'm new to the field, and I'm trying to do a differential gene expression on the GTex dataset. My aim is to identify sets of genes which (with some confidence) identify each of the 50 odd tissue types in the said dataset. The dataset is (bulk) RNA-seq ~50k genes and ~12k samples. The resource I have at hand has ~50 CPU, each with 12 cores and plenty of RAM. I have 1) browsed through the DESeq2 vignettes and I feel it may be a good fit. 2) Removed housekeeping genes, in the hope that it makes the task of the software a little easier. 3) Put the code to run

I was wondering if 1) My choice of algorithm is advisable, and 2) anyone has an estimate of how much time it may take the code to run

I'd be glad to give more details, if you need it.

Thanks for reading through. :-)

ADD COMMENTlink modified 3 months ago • written 3 months ago by vikram0

Removal of house keeping genes violates the assumptions of DESeq2, as it assumes that "most genes do not change". More specifically, DESeq2 assumes that "the median ratio will capture the size relationship" (original quote from Michael Love) between the samples, as this is the basis for the geomatric mean normalization. Intuitively I would therefore assume that whatever the result of the analysis will be, it will not be accurate. You might reconsider this step.

ADD REPLYlink modified 3 months ago • written 3 months ago by ATPoint2.9k

Thanks for the reply! The only reason I removed HK genes was because the matrix seemed to be too large. (50k x 12k). It's taken several hours, and it still hasn't completed.

ADD REPLYlink written 3 months ago by vikram0

On which step is it getting 'stuck'? If you are trying the regularised log transformation (rld()), then it will take a very long time and, for large datasets, the variance stabilising transformation (vst()) is recommended.

As ATPoint says, you should not remove housekeeping genes. Most housekeeping genes don't even behave as we think the behave (i.e., their expression does alter very much between different tissues and different stages of cell activity).

ADD REPLYlink written 3 months ago by Kevin Blighe12k

So, I am doing a DESeqDataSetFromMatrix(), followed by DESeq(), along the lines of this tutorial

The output that I can see thus far is:

estimating size factors

estimating dispersions

gene-wise dispersion estimates

I may be mistaken here, but I think DESeq2 requires raw values for differential expression.

I also tried running DESeq, where it is stuck in the nbinomTest() stage.

ADD REPLYlink modified 3 months ago • written 3 months ago by vikram0

Yes, you should be supplying raw counts to DESeq2, not FPKM/RPKM, or logged data. It should be raw integer counts.

It is quite common for it to take a while at that step, though. I'm not worried about the ~50,000 genes but more the ~12,000 samples! - that will take a very long time to process. You should just leave it overnight, even with 12 CPU cores. Just ensure that parallel is TRUE in the DESeq() function when you execute it?

On my laptop (4 CPU cores), running that part of the process for a dataset of ~30,000 genes and 550 samples takes ~4 hours. I will hold you in high esteem if you manage to normalise 12,000 samples!


ADD REPLYlink written 3 months ago by Kevin Blighe12k

I guess I'll wait. I have set parallel = TRUE, but it seems like the net CPU utilisation does not change by much, and there seems to be too much swapping because memory utlisation is close to 100%.

ADD REPLYlink written 3 months ago by vikram0

As your case is rather specific, I would ask for advice in the DESeq2 forum at Bioconductor. Michael Love is outstandingly responsive and is for sure the best person to ask for speed optimization.

ADD REPLYlink written 3 months ago by ATPoint2.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 782 users visited in the last hour