Question

TMM normalization for data across two sequencing batches

0

Entering edit mode

3.0 years ago

wiscoyogi ▴ 40

I have a data normalization question for some RNAseq data that I'd like to apply CPM-TMM normalization to.

Say I have two sequencing batches
Batch 1: n = 9 biological replicates, with 2 technical replicates/sample + several other samples and respective technical replicates I do not care about
Batch 2: n = 20 biological replicates, with 5 technical replicates/sample + several other samples and respective technical replicates I do not care about

Batch 1 and Batch 2 were sequenced separately, but I only care about the n = 9 in Batch 1 and n = 20 in Batch 2 biological replicate samples for my downstream analysis

My plan was to group the raw counts for the n = 9 and n = 20 biological replicates (plus their technical replicates, 118 in total), and then compute the TMM scaling factors for those 118 samples (only on that joint data frame alone, since I'm only interested in comparing the 118 samples, none of the other samples from the other two batches. Then I'd just average the CPM-TMM counts for a given biological replicate across the technical replicates. Is this the right thing to do?

Or should I compute TMM scaling factors for Batch 1 and Batch 2 separately? My understanding was that TMM is important in that it accounts for inherent biases based on biological conditions of a given sample, and that it might skew the resulting comparisons in expressed genes (like DE, though I'm not doing that), so it's good only to compute for the samples you're interested in comparing between. Therefore computing TMM scaling factors for Batch 1 and Batch 2 separately would not be a good idea prior to comparison.

Is this normalization strategy reasonable (eg get TMM scaling factors only for the 118 samples)? I want to make sure I'm understanding the TMM paper correctly.

tmm normalization rnaseq • 1.0k views

ADD COMMENT • link updated 3.0 years ago by ATpoint 82k • written 3.0 years ago by wiscoyogi ▴ 40

0

Entering edit mode

Typically you first sum (not average) technical replicates, only include samples you are interested in, and then normalize all these samples in one run. The aim of this normalization is to make samples comparable by adjusting for depth and composition, hence you have must normalize data that go into the same analysis together. Adding samples you are not going to analyze is therefore not a good idea if you ask me.

ADD REPLY • link 3.0 years ago by ATpoint 82k