RNA-seq normalization using housekeeping genes in EdgeR
Entering edit mode
4.5 years ago
k.kole ▴ 10


I'm quite new to RNA-seq data analysis and I was wondering if someone could help me out with some normalization. I have 4 biological replicates per sample and would like to normalize the data using housekeeping genes, among other reasons because it is known that the tissues where the RNA comes from contain different amounts of cells hence I would like to correct for this. The question with housekeeping genes always is of course if they are actually stable, but in our case there is data publicly available that can help us select stably expressed genes.

I'm using edgeR for my differential gene expression analysis. I know it has a function that is designed for normalization using HKGs but I'm not sure if this also works in the case where one does have replicates. As far as I understand it, in this way edgeR take 1 experimental group, then calculates the normalization factor it needs to use based on the housekeeping gene expression, and finally applies this to all groups. Is this correct? And should this method be used if one does have biological replicates?

Besides the above, I haven't been able to get it to work quite yet. If I follow the edgeR manual, I end up with all my p-values and FDRs being 0, which is a bit too good to be true. Also I can't seem to use multiple HKGs, even though it is of course sensible to use many of them. Below is the error message I get when I try to use two or more HKGs in the command, although using either one seperately does let me continue (although the resulting p-values/FDRs are not realistic).

> y0 <- estimateCommonDisp(y1["Hkg1","Hkg2"])
Error in subsetListOfArrays(object, i, j, IJ = IJ, IX = IX, I = I, JX = JX) : 
  Subscript not found in colnames

Can anyone tell me what I'm doing wrong? And, perhaps more importantly, is this the right method to go about the normalization, or are there better alternatives that I could try? Many thanks in advance for your input.

===edit=== The code after assigning samples to groups: After assigning samples to groups:

> keep <- rowSums(cpm(y)>1) >= 4
> y <- y[keep, , keep.lib.sizes=FALSE]
> y <- calcNormFactors(y)
> y1 <- y
> y1$samples$group <- 1
> y0 <- estimateCommonDisp(y1["Hkg1",])
> y$common.dispersion <- y0$common.dispersion
> de <- exactTest(y, pair = c("Group1", "Group2"))

Actually I found that if I use other HKGs there actually are more believable p and FDR-values, so it seems to depend on which gene I use for normalization.

RNA-Seq R rna-seq normalization housekeeping genes • 4.2k views
Entering edit mode

What's your experiment design ? Also please post entire code from normalisation to DE calling. You must have something wrong in your code to have all genes FDR = 0.

Entering edit mode

We're looking at different brain regions that either did or did not receive a certain treatment, and we're comparing between them. The different brain regions all come from the same animals (so region A and B can be collected from the same brain). I've edited the original post to contain the code you requested.

Entering edit mode
4.5 years ago
GZ1995 ▴ 400
  1. It is not correct that edgeR will use HK genes for normalization. The default calling of calcNormFactorsuses TMM, with all of genes and replicates used for normalization.

  2. You did not follow edgeR workflow. You should use all of genes for both common and gene-wise dispersion estimation. You did not estimate gene-wise dispersion. And it is now recommended by edgeR author to use GLM version of edgeR. Please read edgeRUsersGuide() or this workflow carefully.

  3. If you want to use HK genes and replicates for normalization, I would recommend RUVSeq. Here is a paper and detailed tutor on how to do this for beginners.


Login before adding your answer.

Traffic: 1834 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6