Question

TPM normalization factors in RNA-seq

1

Entering edit mode

5.0 years ago

a.rex ▴ 350

I have perhaps a naive question:

In RNA-seq with Kallisto, est_counts are generated and TPMs per transcripts can be calculated. However, to compare between sample, sleuth is used to calculate between sample normalization factors. Therefore, TPM for the individual libraries need to be normalized to be scaled to be compared.

In ATAC-seq, I have seem papers that generate coverage tracks that are normalized to RPKM (given a set bin size). This is sufficient to compare between samples. Why is there no generation of a scale factor in this case?

rna-seq • 2.5k views

ADD COMMENT • link 5.0 years ago by a.rex ▴ 350

1

Entering edit mode

Literature is full of flawed analysis. RPKM might be acceptable if you do not expect any large changes in library composition and only depth normalization is required. Still, as you should perform any differential analysis with a proper statistical framework, none I know accepts RPKM but raw counts as input which will then be used to calculate proper scaling/size factors. I suggest you perform your differential analysis with the established tools. For ATAC-seq one could use either DESeq2, edgeR or csaw (which uses edgeR internally). All of them have elaborate normalization strategies that will generate size factors. Go through the manuals of the tools and choose the one you feel comfortable with. The typical workflow I do is:

call peaks on the pooled BAM files of all condiitions with macs2 using --call-summits option
filter against the ENCODE and mitochondrial homolog blacklist (if this exists for your species)
extend peak summits by 500bp and merge peaks in case of overlaps (500bp I find reasonable to on the one hand reduce the size of larger peaks of adjacent open chromatin regions while still respecting the underlying biology with a nucleosome-free region in the center of the peak being somewhat > 147bp as 147bp is the number of bp wrapped around a nucleosome if it was present, plus then two flanking nucleosomes. This is then essentially 3*147bp = 441bp rounded to 500bp and should capture most of the ATAC-seq signal for a given region while not inflating peak size too much. Others may have different opinions on this. csaw for example suggests using sliding windows to completely avoid peak calling for differential )
count reads per peak with countRegions from csaw
normalize with the TMM method on the peak counts as suggested in the csaw manual
feed this into the standard edgeR workflow again as suggested in the csaw manual

ADD REPLY • link 5.0 years ago by ATpoint 82k