Question

Rna-Seq Normalization On 2 Very Different States

3

Entering edit mode

12.8 years ago

Pasta ★ 1.3k

Hi,

We are working on bacteria that can be free-living (Condition #1) or live in a symbiotic fashion in plant cells. In the symbiotic form (Condition #2), the cell is almost dormant with only a few pathways working. Also, we expect some genes to be overexpressed in this condition.

We performed RNA-seq on these 2 conditions and we would like to compare them. We need to normalize these data, so I thought about using total read count normalization. We have 40 million reads with condition #1, and about the same in condition #2 but only 500.000 reads are specific to the bacteria since our samples contained bacterial mRNA + plant mRNA....

Do you think that normalizing on total read count is still relevant ?

Thanks

rna data • 4.1k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 12.8 years ago by Pasta ★ 1.3k

score 1 · Answer 1 · 2011-08-12

Hi pasta

I would second the use of edgeR for your comparisons. The TMM normalisation method of Oshlack and Robinson (Mark D Robinson and Alicia Oshlack, “A scaling normalization method for differential expression analysis of RNA-seq data,” Genome Biology 11, no. 3 (2010): R25.) has been shown to work well in situations such as yours (one sample has many genes highly expressed that are not present in the other).

Furthermore edgeR uses the count data directly and not an abstracted representaton of counts such as RPKM.

Of course to use the package you'll need some familiarlity with R but as far as packages go it's pretty easy and the documentation is very good. The paper above is a good jumping off point.

best

iain

score 0 · Answer 2 · 2011-06-24

Hello,

it is even more relevant to normalize by the read count since you have some dramatic differences between samples. Of course, when I mention read counts I mean reads belonging to the bacteria you study, you can (have to) discard the reads belonging to the symbiotic plant (which nonetheless still can provide insights for some other analyses).

Then, in one side you have to consider the 40 millions reads and in the other only 500.000 and normalize using this data. The main difference would be that you will less likely detect lowly transcribed genes with a lower number of reads and gene expression estimates might also be more noisy (because of this technical bias). This is an aspect you should consider during your analyses.

score 0 · Answer 3 · 2011-08-10

I'm not sure if you already have a normalization method but I had a similar scenario with my sequencing project. Currently I'm doing RNA-seq on yeast and the method we are using gives us lots of "contaminating" ribosomal RNA (rRNA). The great part about this contamination is that it vastly outnumbers the mRNA reads we get thus providing a way to normalize between samples without losing any information in relation to the number of mRNA reads we get.

To normalize the data we're using edgeR. This package in R allows you to input your data and input the total number of reads separately. So for instance if I have a matrix with genes and tag counts for each gene it will have all the mRNAs but not the rRNA. When it asks for my total number of reads however I add the mRNA + rRNA. I believe DegSeq is another good normalization package in R too.

Hope that helped