Question

Normalization of Read counts for RNA-seq

0

Entering edit mode

7.5 years ago

gudraephouto ▴ 10

I know that's not right to ask this question here. But I'm feeling dumb understanding what this text says. Could someone please just simply explain to me that what this text, especially steps 1 to 4, means?

RNA-Seq reads were aligned to the UCSC hg19 transcriptome and genome using Tophat 2.56.

The number of reads aligning unambiguously to each gene in each sample was computed.

Normalization factors were computed in nonstandard manner to mitigate batch effects because two time points (resting and 5 days) were prepped and sequenced separately from the other two (1 day and 2 weeks).

1- First, ordinary normalization factors were computed using Trimmed Mean of M-values, and an intercept-only model was fit to the data using these normalization factors.

2- The genes were split into 100 bins by average counts per million reads, and the 20 genes with the lowest estimated biological coefficients of variation in that bin were selected, yielding a total of 2000 low variability genes distributed across the full range of expression values (approximately 10% of all expressed genes).

3- Trimmed Mean of M-values normalization factors were then computed using only these genes, and the resulting factors had a much smaller correlation with the two batches.

4- These normalization factors were used to normalize the full dataset for all further differential expression analysis and quantification.

Gene counts were analyzed for differential expression using the same model as for the peak analysis.

Gene expression levels for each sample were quantified as FPKM (fragments per kilobase per million fragments sequenced), using the length of the longest transcript isoform for each gene. Batch-corrected average gene expression levels for each condition were quantified by back-transforming the fitted model coefficients for each condition onto a raw count scale and then normalizing to FPKM as for the sample counts.

Cutoffs imposed for differential expression analysis included an FDR of 0.05 and log2 fold change >1 or < − 1.

I'm gonna do a simple DE analysis on this data to compare two models. Do I have to follow their pipeline? I can only normalize the data using TMM or other normalization methods. Is that enough?

Normalization • 8.8k views

ADD COMMENT • link updated 7.5 years ago by Devon Ryan 104k • written 7.5 years ago by gudraephouto ▴ 10

score 5 · Accepted Answer · 2016-10-06

5

Entering edit mode

7.5 years ago

Devon Ryan 104k

The weird normalization steps were only done because they didn't plan their experiment properly and had a batch effect that they couldn't properly compensate for. If you don't have an issue like that, completely ignore steps 1-4 and just use edgeR/DESeq2/limma/etc.

Anyway, I'll rewrite the steps in hopefully less opaque language.

We used edgeR for normalization. We then took the mean of the normalized counts.
We sorted by the mean and divided everything into 100 equally sized groups. We used edgeR to compute the "biological coefficient of variability" of each group and, within each of the 100 groups, selected the 20 genes with the lowest value (Devon's note: their goal here is to select genes that appear to be the least different between the groups. They hope that normalizing with these will correct the batch effect. This is likely incorrect though, so I hope they have other evidence for their results).
The counts for the genes from step 2 were put back into edgeR, it did normalization.
We applied the results from 3 back to the full dataset and pretended that we corrected for the batch effect.

Hope that helps.

ADD COMMENT • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

Oh thank you so much. Your explanation was clear enough. This is the paper.

So to put it in a nutshell, since I'm gonna compare two models on detection of DE genes, these steps can be ignored and I just can use the raw read counts. Good news. Thanks a lot.

ADD REPLY • link 7.5 years ago by gudraephouto ▴ 10

0

Entering edit mode

By the way, they have sequenced each two samples in one lane which can lead to GC content bias. Normalizing such bias can cause some difficulties. Is that ok not to normalize this bias?

ADD REPLY • link 7.5 years ago by gudraephouto ▴ 10

1

Entering edit mode

In my experience lane bias is really over blown. Regarding GC bias between samples in particular, that's incredibly rare in practice these days.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k