Question: Normalization of Read counts for RNA-seq
0
gravatar for gudraephouto
2.7 years ago by
gudraephouto10
gudraephouto10 wrote:

I know that's not right to ask this question here. But I'm feeling dumb understanding what this text says. Could someone please just simply explain to me that what this text, especially steps 1 to 4, means?

RNA-Seq reads were aligned to the UCSC hg19 transcriptome and genome using Tophat 2.56.

The number of reads aligning unambiguously to each gene in each sample was computed.

Normalization factors were computed in nonstandard manner to mitigate batch effects because two time points (resting and 5 days) were prepped and sequenced separately from the other two (1 day and 2 weeks).

1- First, ordinary normalization factors were computed using Trimmed Mean of M-values, and an intercept-only model was fit to the data using these normalization factors.

2- The genes were split into 100 bins by average counts per million reads, and the 20 genes with the lowest estimated biological coefficients of variation in that bin were selected, yielding a total of 2000 low variability genes distributed across the full range of expression values (approximately 10% of all expressed genes).

3- Trimmed Mean of M-values normalization factors were then computed using only these genes, and the resulting factors had a much smaller correlation with the two batches.

4- These normalization factors were used to normalize the full dataset for all further differential expression analysis and quantification.

Gene counts were analyzed for differential expression using the same model as for the peak analysis.

Gene expression levels for each sample were quantified as FPKM (fragments per kilobase per million fragments sequenced), using the length of the longest transcript isoform for each gene. Batch-corrected average gene expression levels for each condition were quantified by back-transforming the fitted model coefficients for each condition onto a raw count scale and then normalizing to FPKM as for the sample counts.

Cutoffs imposed for differential expression analysis included an FDR of 0.05 and log2 fold change >1 or < − 1.

I'm gonna do a simple DE analysis on this data to compare two models. Do I have to follow their pipeline? I can only normalize the data using TMM or other normalization methods. Is that enough?

normalization • 5.5k views
ADD COMMENTlink modified 2.7 years ago by Devon Ryan90k • written 2.7 years ago by gudraephouto10
5
gravatar for Devon Ryan
2.7 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

The weird normalization steps were only done because they didn't plan their experiment properly and had a batch effect that they couldn't properly compensate for. If you don't have an issue like that, completely ignore steps 1-4 and just use edgeR/DESeq2/limma/etc.

Anyway, I'll rewrite the steps in hopefully less opaque language.

  1. We used edgeR for normalization. We then took the mean of the normalized counts.
  2. We sorted by the mean and divided everything into 100 equally sized groups. We used edgeR to compute the "biological coefficient of variability" of each group and, within each of the 100 groups, selected the 20 genes with the lowest value (Devon's note: their goal here is to select genes that appear to be the least different between the groups. They hope that normalizing with these will correct the batch effect. This is likely incorrect though, so I hope they have other evidence for their results).
  3. The counts for the genes from step 2 were put back into edgeR, it did normalization.
  4. We applied the results from 3 back to the full dataset and pretended that we corrected for the batch effect.

Hope that helps.

ADD COMMENTlink written 2.7 years ago by Devon Ryan90k

Oh thank you so much. Your explanation was clear enough. This is the paper.

So to put it in a nutshell, since I'm gonna compare two models on detection of DE genes, these steps can be ignored and I just can use the raw read counts. Good news. Thanks a lot.

ADD REPLYlink written 2.7 years ago by gudraephouto10

By the way, they have sequenced each two samples in one lane which can lead to GC content bias. Normalizing such bias can cause some difficulties. Is that ok not to normalize this bias?

ADD REPLYlink written 2.7 years ago by gudraephouto10
1

In my experience lane bias is really over blown. Regarding GC bias between samples in particular, that's incredibly rare in practice these days.

ADD REPLYlink written 2.7 years ago by Devon Ryan90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1322 users visited in the last hour