Question: Integrating RNA-Seq datasets from different experiment
0
gravatar for lessismore
19 months ago by
lessismore610
Mexico
lessismore610 wrote:

Hey all,

i am integrating RNA-seq dataset which i mapped with kallisto against the reference genome. Now i have TPM for all of them which have been filtered for keeping only expressed genes.

My question is: in integrating the different datasets (belonging to different experiments) would you further normalize this whole dataset (e.g. log2 transform it and quantile normalize, or apply TMM, etc.. ), or would you go directly to the batch effect correction?

Thanks in advance

rna-seq normalization • 978 views
ADD COMMENTlink modified 19 months ago by Kevin Blighe42k • written 19 months ago by lessismore610
0
gravatar for Kevin Blighe
19 months ago by
Kevin Blighe42k
Guy's Hospital, London
Kevin Blighe42k wrote:

Hey,

I would do neither of the above suggestions. I would input the Kallisto raw counts from all samples into DESeq2 using tximport, and then include batch as a factor in the design formula of DESeq2. Take a look here for a tutorial from Michael Love and colleagues: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

I would not advise attempting to directly correct for batch on your raw or TPM counts. It is better to include batch as a covariate or blocking factor in your statistical models. These are recommendations HERE and HERE by statisticians working in the field of expression data normalisation and batch correction. Others have other opinions though, as always.

Good luck!

Kevin

ADD COMMENTlink modified 11 months ago • written 19 months ago by Kevin Blighe42k

Hey @Kevin,

thank you for your answer. Ive read these papers. My final aim is Network analysis, thats why once i import raw counts in DESeq2, i dont want to use any model there because my idea is to use the batch corrected dataset as an input for another program for network building. Thats why i was preferring to log2 transform the TPM for better handling the data, quantile normalize it for making the distributions uniform and correct for batch effects for removing the unwanted variation coming from several experiments (users, dates, etc). Then ill have the input i want for the followin analysis. What do you think?

ADD REPLYlink modified 19 months ago • written 19 months ago by lessismore610
1

Hey, are you aiming to use WGCNA for network analysis, or something else?

From the DESeq2 objects, it's possible to extract raw, normalised, variance-stabilised, and regularised log-transformed counts, which should be sufficient(?). The normalised counts would hopefully be batch-corrected, as batch would be included in the design formula during normalisation.

Edit 18th June 2018:

if including batch as a covariate in design formulae, in order to correct the counts for downstream analysis like WGCNA, ensure that blind=FALSE is set when using the vst() or rld() functions

ADD REPLYlink modified 11 months ago • written 19 months ago by Kevin Blighe42k

Hey Kevin, again very helpful. I was thinking to WGCNA. On their website they suggest to correct with ComBat that's why your advice changes the plans.

ADD REPLYlink modified 19 months ago • written 19 months ago by lessismore610

Yes, Steve Horvath worked in the lab where I was based in Boston - they use WGCNA extensively there. Based on the published manuscripts on batch correction (which we've both read), they state that ComBat and other similar methods are fine if the dataset is balanced.

I guess that what you should do is first see if there is indeed any batch effect. You can do PCA to visually check if the samples segregate based on sampling date, batch, etc. You can also correlate these parameters to the first 5 or 10 PC eigenvectors to see if any significant correlations exist (use cor.test() in R I think - String-based factors will have to be converted to numerical factors).

ADD REPLYlink modified 6 months ago • written 19 months ago by Kevin Blighe42k
1

I guess that what you should do is first see if there is indeed any batch effect. You can do PCA to visually check if the samples segregate based on sampling date, batch, etc.

Hey Kevin, yes indeed is what i did and i observed a batch effect, even if not so strong. The TPM normalization should have reduced it.

You can also correlate these parameters to the first 5 or 10 PC eigenvectors to see if any significant correlations exist (use cor.test() in R I think - String-based factors will have to be converted to numerical factors).

Ill let you know! thanks again, always crucially helping :)

ADD REPLYlink modified 6 months ago by Kevin Blighe42k • written 19 months ago by lessismore610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1153 users visited in the last hour