Question: DESeq2 normalization without the house keeping genes
0
gravatar for gajender.aleti
15 months ago by
San Diego
gajender.aleti0 wrote:

Hi All,

I have been using DESeq2 to perform differential gene expression analysis between two groups but I do not have house keeping genes in the data set. Is it still possible to do DESeq calculations without the house keeping genes? and I do not have all the genes expressed across all subjects/samples.

I have the following error message calculating size factors:

dds <- DESeq(dds)

estimating size factors

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :

every gene contains at least one zero, cannot compute log geometric means

Any help would be appreciated.

Thank you.

Gajender

rna-seq deseq2 R • 977 views
ADD COMMENTlink modified 15 months ago • written 15 months ago by gajender.aleti0
4

DESeq2, by default, does not use or need housekeeping genes.

ADD REPLYlink written 15 months ago by WouterDeCoster37k
2

every gene contains at least one zero, cannot compute log geometric means

Bizarrely, I have come across this error before, but it's very rare. It occurs literally when each of your transcripts contains at least 1 zero. To detect these in a dataset, I wrote the following complex one-liner in R:

First make sure that you remove transcripts that just don't have any counts:

#Remove transcripts of 0 counts across all samples
ZeroCountFilterIndices <- which(apply(MyData, 1, mean)==0)
print(paste("Total transcripts with 0 counts (all samples):", length(ZeroCountFilterIndices), sep=" "))
if (length(ZeroCountFilterIndices)>0)
{
    MyData <- MyData[-ZeroCountFilterIndices,]
}

Then:

#Check if all genes have at least 1 zero (generates error with DESeq2)
if (!any(data.frame(unlist(apply((MyData==0), 1, function(x) table(x))))==ncol(MyData)))
{
    print("Cannot proceed to DESeq2 as all genes or transcripts have at least 1 zero")
    ... other function to quit / stall the process ...
}

Explanation (the logic can be tough):

[1], First convert the entire data-frame to TRUE or FALSE for 0 or non-zero

MyData==0

[2], It then applies the table function per row, which gives TRUE and FALSE tallies per gene

apply( [1] , 1, function(x) table(x))

[3], It then checks if any tally equals the total number of samples - as we've already eliminated genes with all 0 values, the only condition that can meet ncol(MyData) is the FALSE (non-zero) condition. This produces a further TRUE or FALSE for each gene and condition.

data.frame(unlist( [2] ))==ncol(MyData)

[4], Finally, if any TRUE values are present, then we know that at least 1 row has all non-zeros, and therefore we can proceed

if (!any( [3] ))
ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe39k
1

I wrote the following complex one-liner in R

I'd say that's a little long for a one-liner ;-)

ADD REPLYlink modified 15 months ago • written 15 months ago by WouterDeCoster37k

I have a 2 liner, if it is of interest. Lets say df is your counts dataframe and your entire expressed gene across samples. Remove based on rowSums and filter.

dim(df)
#[1] 22082 22
keep <- rowSums(df)>0
df.counts <- df[keep,]
dim(df)
#[1] 20382 22
ADD REPLYlink written 15 months ago by ivivek_ngs4.7k

Just because you can proceed with only a handful of genes with no zero values does not mean you should. The poster needs to know if the problem is a few dropped samples, or a whole bunch. There might be simpler ways to assess the QC of the fastq or the alignment. Even eyeballing the size of the fastq files might tell the poster what they need to know.

ADD REPLYlink written 15 months ago by swbarnes25.0k

I agree to that but OP need to clarify if the QC reports of fastq was problematic or not or let's say the alignment. Problems can be a lot but if the OP is starting with exploratory analysis on counts then filtering is not to be ruled out. Only if it reduces the expression to a lot then it is big reason to worry else not. This is fine provided OP clarifies the FASTQC and alignment reports. Since it was based on DESeq2 and genes expressed so we suggested such

ADD REPLYlink written 15 months ago by ivivek_ngs4.7k

Thank you very much for sharing the script. That's a great help.

Another question - is it possible that some genes are turned on exclusively in healthy but not in diseased subjects and vice-versa. Is there a way to look for such genes as well?

ADD REPLYlink written 15 months ago by gajender.aleti0
1

Yes, I have seen that in breast cancer where, for example, a non-coding RNA can show very high expression in cancer but no (or virtually nil) expression in healthy controls. These would appear through differential expression analysis and have very low statistically significant adjusted P values and high absolute log [base 2] fold changes.

ADD REPLYlink modified 15 months ago • written 15 months ago by Kevin Blighe39k

For sure, I see that! Thank you for answering my question.

ADD REPLYlink modified 15 months ago • written 15 months ago by gajender.aleti0
1

Great - have a nice day / evening

ADD REPLYlink written 15 months ago by Kevin Blighe39k

It is a scenario which happens for non-coding RNAs but also true for protein-coding mRNA transcripts as well if the library is not done well or library enrichment is not in accordance with the kind of regions one is looking for. Another problem is are you quantifying across whole genome or only transcriptome? Likely scenario is for the whole genome, you will have a lot of genes/transcripts that will not have any expression across most samples. It is also seen, at least I have seen when I quantify for the only transcriptome and still, there are dropouts for 5-8% of genes across all samples which I filter out. That can be not only biological but also technological and technical dropouts.

ADD REPLYlink written 15 months ago by ivivek_ngs4.7k
4
gravatar for swbarnes2
15 months ago by
swbarnes25.0k
United States
swbarnes25.0k wrote:

DESeq2 by default doesn't use housekeeping genes. It makes the assumption that most genes are not diferentially expressed, and corrects based on the gene with the median expression

But your problem is exactly as it says; the default normalization method calculates the geometric mean of each gene, that is not possible when a gene has a sample with zero counts. Some genes with some zeroes are fine, the normalization function just ignores them, but in your data, every single gene is being excluded for that reason. The default method of normalization will not work with your dataset as is. If you have a few samples with very few reads, omitting them might fix the problem.

ADD COMMENTlink modified 15 months ago • written 15 months ago by swbarnes25.0k
2

Point is did OP perform any filtering on counts distribution? There has to be the filtering and that one does after plotting the density plot. Post that you should do PCA bi plot or MDS of un filtered and filtered counts followed by same MDS or PCA on normalized data. Your expression set had dropout genes. So discard them based on row sums threshold. I cannot tell how much the threshold should be unless I see the distribution plot of genes total counts across all samples.

ADD REPLYlink written 15 months ago by ivivek_ngs4.7k

works fine when I remove samples that have very low counts. Thanks for your suggestions.

ADD REPLYlink modified 15 months ago • written 15 months ago by gajender.aleti0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 811 users visited in the last hour