Question: Normalization Procedure For The Rna-Seq Data Based On Counts
gravatar for mssestak
7.0 years ago by
Zagreb, Croatia
mssestak20 wrote:


I have a RNA-Seq time-series data (vertebrate development with 9 stages and 2 samples per stage). So far I mapped these reads to the genome using BWA and used these alignments to count the number of reads on the level of genes using HTSeq count since I'm not interested in exons but complete genes. I'm using these counts to first calculate frequencies of expression value for each gene, (e.g., count of particular gene/sum of all counts in a sample), and then to calculate sum of frequencies for each sample and stage. As a result of this procedure I get the cumulative values (cumulative frequency) that show how the total expression varies between the stages. The question: I want to try different normalization techniques (on the HTSeq count results) to see how they affect my cumulative expression values. How to do only the normalization(s) without the differential expression of genes that usually comes later in packages?. What are my options among the packages available?



htseq normalization rna-seq • 14k views
ADD COMMENTlink modified 7.0 years ago by John St. John1.2k • written 7.0 years ago by mssestak20
gravatar for Mikael Huss
7.0 years ago by
Mikael Huss4.7k
Mikael Huss4.7k wrote:

You could use something like this for the voom() normalization/variance stabilization in limma:

normalize.voom <- function(counts){

and something like this for the TMM normalization in edgeR (this function returns CPM, count per million, values as calculated by edgeR)

cpm.tmm <- function(counts, groups=NA){
        d<-DGEList(counts=counts, group=groups)
    d <- calcNormFactors(d, method="TMM") 
    return(cpm(d, normalized.lib.sizes=TRUE))

Of course there are other options as well such as RLE normalization (edgeR/DESeq) and DESeq's variance stabilization.

ADD COMMENTlink written 7.0 years ago by Mikael Huss4.7k

The code that you post for using voom simply computes logCPM. If that's all you want, there's no need to run voom to get it. The benefit of running voom is to compute the weights on each observation, which you are throwing away by extracting just the $E from the resulting EList object.

ADD REPLYlink written 7.0 years ago by Ryan Thompson3.4k

Yes, good point. The way I interpreted the question, the person who asked just wanted to look at expression distributions resulting from using different normalizations (like logCPM) without doing differential expression testing. But that could equally as well have been done by the cpm() in edgeR without TMM, for instance. So I guess the voom() snippet was superfluous.

ADD REPLYlink written 7.0 years ago by Mikael Huss4.7k

Is there a way to have a matrix with the data corrected for variance as in voom ?

I would like to plot my samples before running DE ...

Thanks for your help...

ADD REPLYlink written 5.5 years ago by bnj1950

As mentioned above, voom calculates a log(count per million) but also provides a confidence weight on each observation. So with the code given above, you will in a sense get the "data corrected for variance" (while losing the confidence weights as Ryan commented) - or you can just divide the counts by million reads and take logarithms yourself. Or you could use DESeq2's rlogTransformation() or varianceStabilizingTransformation() instead.

ADD REPLYlink written 5.5 years ago by Mikael Huss4.7k

Can you use rlog function in R to normalize HT-seq counts? The issue is that I have loaded my data matrix consisting of counts generated from HTseq for RNA-Seq time-series data. There are sort of 2 replicates. First replicate has 4 time point samples(day0,2,5,14), while the Second has 5 time point samples(day0,2,5,15,30). The data was loaded as a read.table matrix in R. The following is the script used:

df_A <- read.table(file='Replicate_sample_Raw_RPM_counts-2',header=T,sep='\t')
df_B <- read.table(file='HT-seq_counts.txt-2',header=T,sep='\t')
merged_df <- merge(df_A,df_B,by='geneID')


  day0.x day2.x day5.x day14 day0.y day2.y day5.y day15 day30
1    358    422    241   617    145    508    389   357   594
2     11     31     44    26      8     24     41    49    49
3      7      3     33   392      2      5     25   159   155
4     26     45     74  5624     45    175     94  4604 14238
5      4     10     66   338     19     13     70   229   242
6    477    138     64    21    747    507     98    25    22

But when I try to to run rlog or rlogTransformation I get the following error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’

How can I improve the script to make rlog run so that I can have my HT-seq counts normalized?

ADD REPLYlink written 4.1 years ago by herman.pappoe.4510

Hi Mikael , I am using R code for TMM normalization, but am getting this error

  cpm.tmm <- function(CB_total, groups=NA) {
   + require(edgeR)
   + (groups) {
   Error: unexpected '{' in:
   "require(edgeR) (groups) {"

I am new to R . So can you give me your R code for TMM normalization with explaination (if you can). Thanks.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by k.kathirvel93250
gravatar for John St. John
7.0 years ago by
John St. John1.2k
San Francisco, CA, Cancer Therapeutics Innovation Group
John St. John1.2k wrote:

You could always load the data into an R matrix and then apply whatever normalization technique you want (eg mean, upper quartile, median, whatever else).

ADD COMMENTlink written 7.0 years ago by John St. John1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2091 users visited in the last hour