I know that's not right to ask this question here. But I'm feeling dumb understanding what this text says. Could someone please just simply explain to me that what this text, especially steps 1 to 4, means?
RNA-Seq reads were aligned to the UCSC hg19 transcriptome and genome using Tophat 2.56.
The number of reads aligning unambiguously to each gene in each sample was computed.
Normalization factors were computed in nonstandard manner to mitigate batch effects because two time points (resting and 5 days) were prepped and sequenced separately from the other two (1 day and 2 weeks).
1- First, ordinary normalization factors were computed using Trimmed Mean of M-values, and an intercept-only model was fit to the data using these normalization factors.
2- The genes were split into 100 bins by average counts per million reads, and the 20 genes with the lowest estimated biological coefficients of variation in that bin were selected, yielding a total of 2000 low variability genes distributed across the full range of expression values (approximately 10% of all expressed genes).
3- Trimmed Mean of M-values normalization factors were then computed using only these genes, and the resulting factors had a much smaller correlation with the two batches.
4- These normalization factors were used to normalize the full dataset for all further differential expression analysis and quantification.
Gene counts were analyzed for differential expression using the same model as for the peak analysis.
Gene expression levels for each sample were quantified as FPKM (fragments per kilobase per million fragments sequenced), using the length of the longest transcript isoform for each gene. Batch-corrected average gene expression levels for each condition were quantified by back-transforming the fitted model coefficients for each condition onto a raw count scale and then normalizing to FPKM as for the sample counts.
Cutoffs imposed for differential expression analysis included an FDR of 0.05 and log2 fold change >1 or < − 1.
I'm gonna do a simple DE analysis on this data to compare two models. Do I have to follow their pipeline? I can only normalize the data using TMM or other normalization methods. Is that enough?
Oh thank you so much. Your explanation was clear enough. This is the paper.
So to put it in a nutshell, since I'm gonna compare two models on detection of DE genes, these steps can be ignored and I just can use the raw read counts. Good news. Thanks a lot.
By the way, they have sequenced each two samples in one lane which can lead to GC content bias. Normalizing such bias can cause some difficulties. Is that ok not to normalize this bias?
In my experience lane bias is really over blown. Regarding GC bias between samples in particular, that's incredibly rare in practice these days.