Question: Proper normalization method
2
gravatar for Cytosine
4.4 years ago by
Cytosine440
Ljubljana, Slovenia
Cytosine440 wrote:

Hey everybody,

I'm getting my hands dirty with comparing count data from a couple of RNA-Seq samples. The goal is pretty simple... Normalize the data and compare the expression levels (not looking for DE genes, just expression levels).

I have X total counts in sample 1 and Y total counts in sample 2. They are stored in a bed file of counts per 100 bp windows.

Is it enough to divide the amount of counts in each row of the sample bed files by the total amount of counts from that sample (e.g. for i in rows_in_file: X[i]/X and Y[i]/Y)? Does this make the obtained values comparable between samples?

Consider also that I do not have information on gene structure (e.g. length for RPKM/FPKM).

Would also appreciate hints on other suitable methods for this task. (statistics not being my strong side... yet :) )

Thanks!

ADD COMMENTlink modified 4.4 years ago by Asaf5.5k • written 4.4 years ago by Cytosine440

I guess it depends on exactly what you mean by "compare the expression levels". Do you mean between genes while accounting for sample differences?

ADD REPLYlink written 4.4 years ago by Devon Ryan89k

Yes, that is essentially my goal.

ADD REPLYlink written 4.4 years ago by Cytosine440
3
gravatar for Asaf
4.4 years ago by
Asaf5.5k
Israel
Asaf5.5k wrote:

"(not looking for DE genes, just expression levels)"

Actually, getting DE genes is much easier than getting expression levels. The relation between counts per gene and the actual amount of mRNAs in the sample is not trivial. When you compare the counts per gene between two samples you assume that the change you see in the counts represents the change in mRNAs because the biases should be the same. Try taking the same example and preparing RNA-seq libraries using two different protocols (we did it), you'll see that some genes (or a lot of genes) will have different counts in the two libraries (and this difference will be reproducible) this means that counts doesn't truly represents the mRNAs.

EDIT:

Following the comments I'll add some more thoughts. I came across a similar situation, luckily for me the total number of reads in the two libraries was about the same so I could just compare the raw counts between the libraries. If the two libraries look very much the same (you can plot the base-by-base correlation) then dividing each count in the total (mapped) number of counts in the library is not a bad idea.

If the libraries are different (below 0.9 or 0.85 Spearman correlation) I wouldn't divide in the total number of reads and would prefer the DESeq normalization method. In a nutshell, you assume that most of the positions in the genome have the same expression levels in the two libraries. For each position you compute Xi/Yi and take the median of these quotients as your normalization factor (multiply Yj by the median for each j in the genome). You are now able to compare Xj with the normalized Yj.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Asaf5.5k

Well my goal is to compare expression levels between samples because I do not have structural information about genes. I do agree that the counts I get from a pileup do not represent the actual amount of mRNA in the sample, but for the sake of comparing between samples prepared with the same protocol, the similar systematic bias present in both samples and introduced by the RNA extraction protocol should be negated with between-samples normalization of their counts.

The actual levels of mRNA in the sample do not interest me and I believe to get them, a wet-lab approach would be more suitable than an in-silico one.

ADD REPLYlink written 4.4 years ago by Cytosine440

Can't you predict transcripts from the RNA-seq data?

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Asaf5.5k

I can, but that's not what I'm trying to achieve here.

ADD REPLYlink written 4.4 years ago by Cytosine440

I think I understand what you're trying to do now, I'll edit my answer.

ADD REPLYlink written 4.4 years ago by Asaf5.5k

Thank you for this one! My libraries differ quite a lot, so comparison of raw counts won't cut it for me. I'll try out the normalization method you mentioned. Just a detail though... Do you mean to multiply both Yj and Xj by the median for each j in the genome or just Yj?

ADD REPLYlink written 4.4 years ago by Cytosine440

The complete method is described in the DESeq manual/paper (for genes rather than positions but it should work as well). You should multiply only Yj in the median quotient so it will comparable to Xj (write it down, the equation will make sense)

ADD REPLYlink written 4.4 years ago by Asaf5.5k
2
gravatar for Devon Ryan
4.4 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

The problem with total count normalization is that it's sensitive to any change in highly-expressed species (e.g., a difference in rRNA depletion will cause incorrect normalization). You have three popular options to handle this type of data: (1) the DESeq/DESeq2 normalization method, RLE; (2) the edgeR normalization method, TMM; and (3) quantile normalization. In most cases, these produce similar results. In your case, I'd personally try quantile normalization followed by a paired T-test. You'll still want to have a look at the results along side the raw data and see if it makes any sense at all.

ADD COMMENTlink written 4.4 years ago by Devon Ryan89k

A step closer to getting in grips with biostatistics. Thanks Devon!

ADD REPLYlink modified 4.4 years ago by Devon Ryan89k • written 4.4 years ago by Cytosine440

Happy to help! After you get the results, please post a comment and let me know how well that actually works in practice.

ADD REPLYlink written 4.4 years ago by Devon Ryan89k
2

Following up on this topic... Devon, I tried your suggestion for a quantile normalization and the processed data seemed to act a bit strange (lot of zero values appearing). Then I also tried your and Asaf's suggestion about the DESeq normalization and it seemed to perform much better (very few something->nothing and vice-versa conversions), so I decided this is the one I'm going to use. Thumbs up to both of you for this suggestion!

ADD REPLYlink written 4.4 years ago by Cytosine440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1869 users visited in the last hour