TMM normalization for RNA seq data with three replicates (2 samples)
2
0
Entering edit mode
5.9 years ago
IrK ▴ 70

Hi guys,

I'm working on RNAseq data as well, where we have two samples (KO and WT) and three replicates for each, so in total 6. We are looking at the coverage of reads at 5' position. I would like to compare the performance of coverage of specific genes with IGV, however before that I have to normalize the data. I am thinking of trying TMM normalization, but I am confused how I have to treat replicates,

The edgeR package:

calNormFactors(object, method=('TMM').....)


How do I represent an object here? as 6 column matrix ? Or would you advice any other normalization methods for this case?

Thank you,

edgeR RNA-Seq replicates TMM • 4.9k views
0
Entering edit mode
5.9 years ago
IrK ▴ 70

I couldn't find example of TMM normalization with replicates yet.

However, I am wondering if I perform CPM (count per million) for all 3 replicates (for two phenotypes KO and WT) and then find mean of these results, do is make statistically sense?

cpm_of_KO1  cpm_of_KO2     cpm_of_KO3
mean_of_(cpm_of_KO1,  cpm_of_KO2,  cpm_of_KO3)
normalized count data for KO                             (same for WT)

2
Entering edit mode

Did you read edgeR user guide? It has lots of examples, most (probably all) of them with replicates. TMM is the default method for calcNormFactors function.

0
Entering edit mode
5.9 years ago

You should have 1 columns per sample, for a total of 6 in your case. All examples you've likely ever seen of TPM have replicates, since the actual experimental design plays absolutely no roll in TPM normalization.

0
Entering edit mode

Thank you, Devon

Do you mind to check my pipeline for trimmed mean normalization (TMM) I use edgeR package. I also have a question about cpm () option normalized.lib.size (use normalized library sizes), I cant understand the meaning of this sentence, so if I have norm.factors do I need to specify normalized.lib.size as TRUE (pls see my code below)?

Trimmed mean of M (TMM) normalization:

1. Build a matrix with raw count data as matrix (dim(matrix)=Num_rows*6)

sampl1_ko1=file1[,4]
sampl1_ko2=file2[,4]
sampl1_ko3=file3[,4]

sampl1_wt1=file1[,4]
sampl1_wt2=file2[,4]
sampl1_wt3=file3[,4]

minus= as.matrix ( cbind (sampl1_ko1, sampl1_ko2, sampl1_ko3, sampl1_wt1,sampl1_wt2,sampl1_wt3 ))

2. Specify amount of replicates and samples

group=c(rep('KO',3),rep('WT',3))

3. Create the DGEList object

count_tbl<- DGEList(counts = minus, group=group)
count_tbl_norm_factors<- calcNormFactors (count_tbl, method=c('TMM'))


 group lib.size norm.factors cov1_mn_ko KO 480979 0.5490111 cov2_mn_ko KO 465474 0.4833874 cov3_mn_ko KO 619070 0.3399332 cov1_mn_wt WT 92693 2.2296840 cov2_mn_wt WT 92693 2.2296840 cov3_mn_wt WT 92693 2.2296840 

4. Normalize count table

c=cpm(count_tbl_norm_factors, normalized.lib.sizes=TRUE)

0
Entering edit mode

How are the KO and WT counts coming from the same files (file1[,4] is making both sampl_ko1 and sampl_wt1)? Also, it's unusual that all of the WT samples have the exact same lib.size and norm.factors. Aside from that, it'd be simpler to:

count_tbl <- DGEList(counts = minus, group=group)
count_tbl <- calcNormFactors(count_tbl, method=c('TMM'))


Regarding cpm(), normalized.lib.sizes=TRUE is actually the default. The confusion here is probably due to how that's named. If you set that to false, then the cpm will be calculated using "library size normalization", meaning the results are:

counts/(1e-6 * lib.size) # lib.size is 480979, 465474, 619070, 92693 ...


This uses the non-robust "library size normalization", which is not preferred. If you use the default settings then the norm.factors will get incorporated and you'll get more useful results (the edgeR authors are pretty good about choosing appropriate defaults for everything).

0
Entering edit mode

thank you very much for your respond and explanation, I appreciated very much

you are right I made a mistake with file names in this post, WT and KO come from different files. In regard to the same WT, my mistake in reading the same file three times. :)))

now its perfect:

group lib.size norm.factors
cov1_mn_ko    KO   480979    0.5987647
cov2_mn_ko    KO   465474    0.6061697
cov3_mn_ko    KO   619070    0.4162593
cov1_mn_wt    WT   108708    1.6843754
cov2_mn_wt    WT   122504    1.9098760
cov3_mn_wt    WT    92693    2.0575083

0
Entering edit mode

I would also like to clarify the following issue:

I am using edgeR package to normalize my count data with (TMM norm.). So as a result I expect to have a table of normalized counts, I am not looking for the DE at the moment. Once, I run calcNormFactors (this function finds a set of scaling factors for the library size that minimize the log-fold changes btw samples and most genes [See this link]) how can I see the normalized count table? Would it be correct to extract norm.counts by submitting the norm.factors to cpm()?

c=cpm(count_tbl_norm_factors, normalized.lib.sizes=TRUE)


I am confused how can I get normalized by TMM count table after this step, as the manual says ( In edgeR, a pseudo-count is a type of normalized count, however users are advised not to interpret the psuedo-counts as general-purpose normalized counts)

the same question was not answered in the post: How To Export Normalized Counts From Edger

p.s: my goal is to compare the normalization of CPM to TMM and to select the best performance. However, I am a bit stuck with the TMM.

0
Entering edit mode

I don't know that there's an "approved" way of getting normalized counts from edgeR. Realistically, you could probably just multiply the cpm by a million.

0
Entering edit mode

I found CPM as:

cpm=(counts*10^6)/unique_aligned_reads     # Do you mean like this?


Another question which other good technique I can use to normalize raw count table of RNA seq data? I read that there are three good once, rpkm - in my case I use CPM, because of the given data; the TMM, which can't produce normalized table of counts, because it's intermediate step of the edgeR package; and Upper-quartile, which I heard is not so good as presented.

0
Entering edit mode

The three good methods are TMM, RLE (what's used in DESeq2) and quantile normalization. CPM/RPKM/FPKM aren't normally used for statistics, just visualization.

0
Entering edit mode

Thank you, I will try RLE then.