Rna Composition Bias In Rna-Seq Analysis
1
3
Entering edit mode
8.7 years ago
narges ▴ 200

Hi,

Normalization is one of the most important primary steps before running RNA-seq analysis. One of the biases which normalization methods try to solve is the RNA composition bias. I think I am misunderstanding this bias. Based on edgeR manual, it comes from the fact that we have the relative abundance of genes to the total amount of RNA for our sample but not the total amount of RNA in the cell. Therefore, it leads to have some biases for hugely expressed genes. Actually I do not understand why is this so. If a gene is hugely expressed in one sample but not in another it should be reported as DE gene. What is the bias about it?

rna-seq • 8.4k views
0
Entering edit mode

There are many types of normalization in RNAseq and it's not entirely clear (to me at least) to which one you're referring. From context, I assume you're talking about library size normalization, but perhaps you mean GC content or transcript length normalization (these aren't standard in edgeR). Could you clarify to which you're referring?

0
Entering edit mode

No, I did not mean GC content. I think GC content does not affect the results very much as it could be some how the same for all libraries. But i meant the RNA composition bias that for example we use calcNormFactors() method in edgeR to eliminate it. I meant I expected that we use calcNormFactors() method to eliminate the sequencing depth varieties. But now I do not understand what does this "RNA composition" bias mean.

0
Entering edit mode

Ah, you're referring to section 2.5.3 in the user guide, then. I'll create an answer below.

0
Entering edit mode

yes exactly I was mentioning that part.

4
Entering edit mode
8.7 years ago

The bias is less one seen in hugely expressed genes, but in cases were some genes are very highly expressed in group A but very lowly expressed in group B. If those are the only truly DE genes, then you need to choose a normalization that takes not only library size into account, but also that doesn't then result in the non-DE genes in group A being called DE (in this case, having lower expression in group A than group B). A more concrete example is going to be much more informative. For the sake of simplicity, let's just look at 2 hypothetical samples with exactly the same number of reads and half the genes being identically expressed and the other half being specific to one group (N.B., this is just a reformulation of the edgeR manuscript).

       group1   group2
gene1     500        0
gene2     500        0
gene3     500     1000
gene4     500     1000


Each of the groups have 2000 counts, but if you took only library size into account, then all of the genes would be DE instead of just those for genes 1 and 2 (in a real example, there would be many many more genes like gene3 and gene4, of course). A correct normalization method should try to account for this sort of case and should produce a scaling factor of around 2 or so for group1. Have a read through this section of the edgeR paper, which I find very helpful in thinking about this sort of thing.

1
Entering edit mode

So It's basically sampling bias.

0
Entering edit mode

Sampling bias. Do you mean technical biases which are available when extracting RNA molecules from the cell?

2
Entering edit mode

It's a bit counter-intuitive I guess. If we assume gene 3 and 4 should be expressed at exactly the same level in group 1 and 2, then there is technically over-sampling happening in group 2. Think about it this way:

• Let's say group 1 and group 2 are 2 biological samples of equal number of cells. Each cell in their biological samples are identical in transcript composition.
• Gene 3 and 4 are equally expressed in the 2 biological samples
• Gene 1, 2, 3, 4 are equally expressed in cells of group 1
• Let's say we sampled 500 group 1 cells and ended up with 500 transcripts of each gene.
• To also get 500 transcripts of gene 3 and 4, we would sample 500 group 2 cells.
• To get 1,000 transcripts of gene 3 and 4, we would have to sample 1000 group 2 cells.

So according to that example above, group 2 was over-sampled. There was just less transcripts in each of the cells in group 2, so we had to sample more cells to have the same total number of transcripts for both groups.

0
Entering edit mode

Based on your example, can I assume the situation you mentioned is something different to "sequencing depth" bias. So, now I think we can have technical variabilities because of different sequencing depth between samples. Moreover, we can have technical veriablities because of sampling different number of cells(as you mentioned).

0
Entering edit mode

In essence, yeah.

0
Entering edit mode

Are you sure that your sample is reformulating the example in the paper? Based on my understanding, in the paper example, it intends to say the total number of RNA reads in one group should be two times of the other group so we should have factor of 2. But in your example the total library size for both groups is 2000 and the same. Sorry I think I did not get what you mean.

0
Entering edit mode

Yes, my sample more or less reformulates the thought experiment in the paper. The entire point is to not just use raw library size as the normalization factor, since increased reads in one gene will decrease reads in another.

0
Entering edit mode

Actually based on the paper section you mentioned and the example, I some how understand we should not use only the raw library size as the normalization factor. But I do not understand why necessarily increase in one gene should lead to decrease in another. Otherwise I think now I understand what you mean.

1
Entering edit mode

The whole a "more reads in gene 1 decreasing reads in other genes" thing is probably often not understood, so you're not alone. i wonder if this comes from the microarray days, when this wasn't the case.

This is due to the fact that reads are a finite resource. Let's take another example, where group1 is as above and group2 is identical to group1, but with gene1 upregulated 10x. Here, let us magically know that, in reality, the absolute number of transcript copies for gene2-gene4 are actually identical. Because of the finite number of reads available, we might then expect something like the following counts:

       group1  group2
gene1  500     1538
gene2  500     154
gene3  500     154
gene4  500     154


It's because of the finite resources that increases in one gene can artificially decrease reads in other genes. Here again, just accounting for library size would be misleading. In most cases, the majority of genes between samples won't be differentially expressed, so you can utilize that fact in the normalization. Of course, this might not always be the case, in which case spike-ins or other such things can be used.

2
Entering edit mode

Yes, and to restate it another way, it's that sequencing counts are implicitly a relative measurement (and microarrays were absolute, approximately).

The counts you see in an experiment are proportional to the number of (say) mRNA products of that gene, divided by the total number of mRNA products in the cell in that experiment. That fraction is multiplied by the total number of reads you got off the sequencer to get the observed count. So it's a ratio of concentrations, not an absolute concentration.

The implicit dependence on the total number of mRNA products in the cell means bad stuff can happen when that denominator changes, and motivates the development of these more sophisticated normalization methods.

0
Entering edit mode

0
Entering edit mode

This paper, A scaling normalization method for differential expression analysis of RNA-seq data, seems to be the best reference for this topic specifically.

0
Entering edit mode

I was wondering if I could ask one more question or it would be impolite. Just to be sure I wanted to ask does this mean that the sequencing machine may remove the "RNA composition" bias in some sense. However, many many thanks I learned a lot.