Question

dealing with BATCH effect/ donor variation in RNA seq data

1

Entering edit mode

9.8 years ago

sbuschow ▴ 10

HI all,

I am struggling with a statistical question related to RNA seq data.

I have collected RNA data on 4 different cell types collected from the same person. I have collected cells sets/batches from 3 individuals. Each batch of 4 cell types was prepared for RNA sequencing separately.

My goals are 1) to identify the relation between these cell types (which ones are most comparable and which ones are more dfferent) 2) to find genes differentially expressed between the cell types

After recieving back the sequencing data I noticed a clear batch/donor effect between the 3 sets of samples.

Standard normalisation procedures were not effective as the batch effect was different for genes with a high number of reads as compared to those with low read counts. Samples clustered according to donor not cell type.

What does work pretty nicely is to divide the expression values for each sample by the mean expression over all 4 cell types from that donor (e.g. scale for the batch difference per gene) and then cluster the thus scaled values as input for clustering (after log transformation). After doing this for all donors separately I get a nice clustering according to cell type.

My question now is whether this action is something you can do? I can not find any literature on a similar case.

Secondly I would like to no whether I can use the obtained values for statistical test to find DEGs such as ANOVA? I realize I have made the samples interdependent per donor by centering on the mean and I am removing variation in gene expression levels, so it does not feel completely right but because the clustering performs so well I am tempted to continue. Also because genes that are always higher in one cell type as compared to the other would be interesting to me.

Any feed back on possible mistakes I am intruducing and/or alternatives methods I can use are very much appreciated!

Thank you!

Sonja

statistics RNA-Seq ANOVA • 5.4k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.8 years ago by sbuschow ▴ 10

Ram · Answer 1 · 2014-07-15

3

Entering edit mode

9.8 years ago

Ming Tommy Tang ★ 3.9k

Use bioconductor package sva

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Ming Tommy Tang ★ 3.9k

Ram · Answer 2 · 2015-03-17

0

Entering edit mode

9.1 years ago

kangyueapril ▴ 80

SVA is more suitable for microarray data. For RNA-seq, you can just leave the batch difference when you do normalized. But when you find the DEGs, you should build your model use both condition and batch as factor. Then find DEGs in condition factor.

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by kangyueapril ▴ 80