Question: TMM normalization for RNA seq data with three replicates (2 samples)
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

Hi guys,

I 'm working on RNAseq data as well, where we have two samples (KO and WT) and three replicates for each, so in total 6. We are looking at the coverage of reads at 5' position. I would like to compare the performance of coverage of specific genes with IGV, however before that I have to normalize the data. I am thinking of trying TMM normalization, but I am confused how I have to treat replicates, 

The edgeR package:              calNormFactors(object, method=('TMM').....)

How do I represent an object here? as 6 column matrix ? Or would you advice any other normalization methods for this case? 

Thank you,

 

tmm rna-seq edger replicates • 2.9k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by IrK30
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

I couldn't find example of TMM normalization with replicates yet.

However, I am wondering if I perform CPM (count per million) for all 3 replicates (for two phenotypes KO and WT) and then find mean of these results, do is make statistically sense?     

cpm_of_KO1  cpm_of_KO2     cpm_of_KO3      

   mean_of_(cpm_of_KO1,  cpm_of_KO2,  cpm_of_KO3)

          normalized count data for KO                             (same for WT)     
ADD COMMENTlink modified 24 months ago by h.mon26k • written 3.5 years ago by IrK30
2

Did you read edgeR user guide? It has lots of examples, most (probably all) of them with replicates. TMM is the default method for calcNormFactors function.

ADD REPLYlink written 3.5 years ago by h.mon26k
0
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

You should have 1 columns per sample, for a total of 6 in your case. All examples you've likely ever seen of TPM have replicates, since the actual experimental design plays absolutely no roll in TPM normalization.
 

ADD COMMENTlink written 3.5 years ago by Devon Ryan91k
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

Thank you, Devon

 

Do you mind to check my pipeline for trimmed mean normalization (TMM) I use edgeR package. I also have a question about cpm () option normalized.lib.size (use normalized library sizes), I cant understand the meaning of this sentence, so if I have norm.factors do I need to specify  normalized.lib.size  as TRUE (pls see my code below)?

Trimmed mean of M (TMM) normalization:

# 1) Build a matrix with raw count data as matrix (dim(matrix)=Num_rows*6)

sampl1_ko1=file1[,4]
sampl1_ko2=file2[,4]
sampl1_ko3=file3[,4]

sampl1_wt1=file1[,4]
sampl1_wt2=file2[,4]
sampl1_wt3=file3[,4]

minus= as.matrix ( cbind (sampl1_ko1, sampl1_ko2, sampl1_ko3, sampl1_wt1,sampl1_wt2,sampl1_wt3  ))

# 2) Specify amount of replicates and samples
group=c(rep('KO',3),rep('WT',3))

# 3) Create the DGEList object
count_tbl<- DGEList(counts = minus, group=group)
count_tbl_norm_factors<- calcNormFactors (count_tbl, method=c('TMM'))

 group lib.size norm.factors
cov1_mn_ko    KO   480979    0.5490111
cov2_mn_ko    KO   465474    0.4833874
cov3_mn_ko    KO   619070    0.3399332
cov1_mn_wt    WT    92693    2.2296840
cov2_mn_wt    WT    92693    2.2296840
cov3_mn_wt    WT    92693    2.2296840
# 4) Normalize count table
c=cpm(count_tbl_norm_factors, normalized.lib.sizes=TRUE)
ADD COMMENTlink written 3.5 years ago by IrK30
0
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

How are the KO and WT counts coming from the same files (file1[,4] is making both sampl_ko1 and sampl_wt1)? Also, it's unusual that all of the WT samples have the exact same lib.size and norm.factors. Aside from that, it'd be simpler to:

count_tbl <- DGEList(counts = minus, group=group)
count_tbl <- calcNormFactors(count_tbl, method=c('TMM'))

Regarding cpm(), "normalized.lib.sizes=TRUE" is actually the default. The confusion here is probably due to how that's named. If you set that to false, then the cpm will be calculated using "library size normalization", meaning the results are:

counts/(1e-6 * lib.size) # lib.size is 480979, 465474, 619070, 92693 ...

This uses the non-robust "library size normalization", which is not preferred. If you use the default settings then the norm.factors will get incorporated and you'll get more useful results (the edgeR authors are pretty good about choosing appropriate defaults for everything).

ADD COMMENTlink written 3.5 years ago by Devon Ryan91k
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

thank you very much for your respond and explanation, I appreciated very much

you are right I made a mistake with file names in this post,  WT and KO come from different files. In regard to the same WT, my mistake in reading the same file three times. :)))

now its perfect:

group lib.size norm.factors
cov1_mn_ko    KO   480979    0.5987647
cov2_mn_ko    KO   465474    0.6061697
cov3_mn_ko    KO   619070    0.4162593
cov1_mn_wt    WT   108708    1.6843754
cov2_mn_wt    WT   122504    1.9098760
cov3_mn_wt    WT    92693    2.0575083

 

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by IrK30
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

I would also like to clarify the following issue: 

I am using edgeR package to normalize my count data with (TMM norm.). So as a result I expect to have a table of normalized counts, I am not looking for the DE at the moment. Once, I run calcNormFactors (this function finds a set of scaling factors for the library size that minimize the log-fold changes btw samples and most genes [https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html]) how can I see the normalized count table? Would it be correct to extract norm. counts by submitting the norm.factors to  cpm()?

c=cpm(count_tbl_norm_factors, normalized.lib.sizes=TRUE)

I am confused how can I get normalized by TMM count table after this step, as the manual says ( In edgeR, a pseudo-count is a type of normalized count, however users are advised not to interpret the psuedo-counts as general-purpose normalized counts)

the same question was not answered in the post : How To Export Normalized Counts From Edger

 

p.s: my goal is to compare the normalization of CPM to TMM and to select the best performance. However,  I am a bit stuck with the TMM.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by IrK30

I don't know that there's an "approved" way of getting normalized counts from edgeR. Realistically, you could probably just multiply the cpm by a million.

ADD REPLYlink written 3.5 years ago by Devon Ryan91k
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

I found CPM as:

cpm=(counts*10^6)/unique_aligned_reads     # Do you mean like this? 

Another question which other good technique I can use to normalize raw count table of RNA seq data? I read that there are three good once, rpkm - in my case I use CPM, because of the given data; the TMM, which can't produce normalized table of counts, because it's intermediate step of the edgeR package; and  Upper-quartile, which I heard is not so good as presented.

 

 

ADD COMMENTlink written 3.5 years ago by IrK30

The three good methods are TMM, RLE (what's used in DESeq2) and quantile normalization. CPM/RPKM/FPKM aren't normally used for statistics, just visualization.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Devon Ryan91k
0
gravatar for IrK
3.5 years ago by
IrK30
Australia
IrK30 wrote:

Thank you, I will try RLE then.

ADD COMMENTlink written 3.5 years ago by IrK30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour