DESeq2 normalization without the house keeping genes
1
2
Entering edit mode
4.7 years ago

Hi All,

I have been using DESeq2 to perform differential gene expression analysis between two groups but I do not have house keeping genes in the data set. Is it still possible to do DESeq calculations without the house keeping genes? and I do not have all the genes expressed across all subjects/samples.

I have the following error message calculating size factors:

dds <- DESeq(dds)

estimating size factors

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :

every gene contains at least one zero, cannot compute log geometric means

Any help would be appreciated.

Thank you.

Gajender

DESeq2 RNA-Seq R • 3.2k views
4
Entering edit mode

DESeq2, by default, does not use or need housekeeping genes.

2
Entering edit mode

every gene contains at least one zero, cannot compute log geometric means

Bizarrely, I have come across this error before, but it's very rare. It occurs literally when each of your transcripts contains at least 1 zero. To detect these in a dataset, I wrote the following complex one-liner in R:

First make sure that you remove transcripts that just don't have any counts:

#Remove transcripts of 0 counts across all samples
ZeroCountFilterIndices <- which(apply(MyData, 1, mean)==0)
print(paste("Total transcripts with 0 counts (all samples):", length(ZeroCountFilterIndices), sep=" "))
if (length(ZeroCountFilterIndices)>0)
{
MyData <- MyData[-ZeroCountFilterIndices,]
}


Then:

#Check if all genes have at least 1 zero (generates error with DESeq2)
if (!any(data.frame(unlist(apply((MyData==0), 1, function(x) table(x))))==ncol(MyData)))
{
print("Cannot proceed to DESeq2 as all genes or transcripts have at least 1 zero")
... other function to quit / stall the process ...
}


Explanation (the logic can be tough):

[1], First convert the entire data-frame to TRUE or FALSE for 0 or non-zero

MyData==0


[2], It then applies the table function per row, which gives TRUE and FALSE tallies per gene

apply( [1] , 1, function(x) table(x))


[3], It then checks if any tally equals the total number of samples - as we've already eliminated genes with all 0 values, the only condition that can meet ncol(MyData) is the FALSE (non-zero) condition. This produces a further TRUE or FALSE for each gene and condition.

data.frame(unlist( [2] ))==ncol(MyData)


[4], Finally, if any TRUE values are present, then we know that at least 1 row has all non-zeros, and therefore we can proceed

if (!any( [3] ))

1
Entering edit mode

I wrote the following complex one-liner in R

I'd say that's a little long for a one-liner ;-)

0
Entering edit mode

I have a 2 liner, if it is of interest. Lets say df is your counts dataframe and your entire expressed gene across samples. Remove based on rowSums and filter.

dim(df)
#[1] 22082 22
keep <- rowSums(df)>0
df.counts <- df[keep,]
dim(df)
#[1] 20382 22

0
Entering edit mode

Just because you can proceed with only a handful of genes with no zero values does not mean you should. The poster needs to know if the problem is a few dropped samples, or a whole bunch. There might be simpler ways to assess the QC of the fastq or the alignment. Even eyeballing the size of the fastq files might tell the poster what they need to know.

0
Entering edit mode

I agree to that but OP need to clarify if the QC reports of fastq was problematic or not or let's say the alignment. Problems can be a lot but if the OP is starting with exploratory analysis on counts then filtering is not to be ruled out. Only if it reduces the expression to a lot then it is big reason to worry else not. This is fine provided OP clarifies the FASTQC and alignment reports. Since it was based on DESeq2 and genes expressed so we suggested such

0
Entering edit mode

Thank you very much for sharing the script. That's a great help.

Another question - is it possible that some genes are turned on exclusively in healthy but not in diseased subjects and vice-versa. Is there a way to look for such genes as well?

1
Entering edit mode

Yes, I have seen that in breast cancer where, for example, a non-coding RNA can show very high expression in cancer but no (or virtually nil) expression in healthy controls. These would appear through differential expression analysis and have very low statistically significant adjusted P values and high absolute log [base 2] fold changes.

0
Entering edit mode

For sure, I see that! Thank you for answering my question.

1
Entering edit mode

Great - have a nice day / evening

0
Entering edit mode

It is a scenario which happens for non-coding RNAs but also true for protein-coding mRNA transcripts as well if the library is not done well or library enrichment is not in accordance with the kind of regions one is looking for. Another problem is are you quantifying across whole genome or only transcriptome? Likely scenario is for the whole genome, you will have a lot of genes/transcripts that will not have any expression across most samples. It is also seen, at least I have seen when I quantify for the only transcriptome and still, there are dropouts for 5-8% of genes across all samples which I filter out. That can be not only biological but also technological and technical dropouts.

4
Entering edit mode
4.7 years ago

DESeq2 by default doesn't use housekeeping genes. It makes the assumption that most genes are not diferentially expressed, and corrects based on the gene with the median expression

But your problem is exactly as it says; the default normalization method calculates the geometric mean of each gene, that is not possible when a gene has a sample with zero counts. Some genes with some zeroes are fine, the normalization function just ignores them, but in your data, every single gene is being excluded for that reason. The default method of normalization will not work with your dataset as is. If you have a few samples with very few reads, omitting them might fix the problem.

2
Entering edit mode

Point is did OP perform any filtering on counts distribution? There has to be the filtering and that one does after plotting the density plot. Post that you should do PCA bi plot or MDS of un filtered and filtered counts followed by same MDS or PCA on normalized data. Your expression set had dropout genes. So discard them based on row sums threshold. I cannot tell how much the threshold should be unless I see the distribution plot of genes total counts across all samples.

0
Entering edit mode

works fine when I remove samples that have very low counts. Thanks for your suggestions.