Question: dealing with BATCH effect/ donor variation in RNA seq data
gravatar for sbuschow
6.4 years ago by
European Union
sbuschow10 wrote:

HI all,

I am struggling with a statistical question related to RNA seq data.

I have collected RNA data on 4 different cell types collected from the same person. I have collected cells sets/batches from 3 individuals. Each batch of 4 cell types was prepared for RNA sequencing separately.

My goals are 1) to identify the relation between these cell types (which ones are most comparable and which ones are more dfferent) 2) to find genes differentially expressed between the cell types

After recieving back the sequencing data I noticed a clear batch/donor effect between the 3 sets of samples.

Standard normalisation procedures were not effective as the batch effect was different for genes with a high number of reads as compared to those with low read counts. Samples clustered according to donor not cell type.

What does work pretty nicely is to divide the expression values for each sample by the mean expression over all 4 cell types from that donor (e.g. scale for the batch difference per gene) and then cluster the thus scaled values as input for clustering (after log transformation). After doing this for all donors separately I get a nice clustering according to cell type.

My question now is whether this action is something you can do? I can not find any literature on a similar case.

Secondly I would like to no whether I can use the obtained values for statistical test to find DEGs such as ANOVA? I realize I have made the samples interdependent per donor by centering on the mean and I am removing variation in gene expression levels, so it does not feel completely right but because the clustering performs so well I am tempted to continue. Also because genes that are always higher in one cell type as compared to the other would be interesting to me.  

Any feed back on possible mistakes I am intruducing and/or alternatives methods I can use are very much appreciated!

Thank you!












statistics rna-seq anova • 4.0k views
ADD COMMENTlink modified 5.7 years ago by kangyueapril80 • written 6.4 years ago by sbuschow10
gravatar for Ming Tang
6.4 years ago by
Ming Tang2.6k
Houston/MD Anderson Cancer Center
Ming Tang2.6k wrote:

use bioconductor package sva

ADD COMMENTlink written 6.4 years ago by Ming Tang2.6k
gravatar for kangyueapril
5.7 years ago by
United States
kangyueapril80 wrote:

SVA is more suitable for microarray data. For RNA-seq, you can just leave the batch difference when you do normalized. But when you find the DEGs, you should build your model use both condition and batch as  factor. Then find DEGs in condition factor.

ADD COMMENTlink written 5.7 years ago by kangyueapril80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1094 users visited in the last hour