Question

RNASeq - How to compare the expression of a single gene across batches

0

Entering edit mode

3.5 years ago

oakhamwolf ▴ 20

Hoping someone can help. Thanks for taking the time to read this post.

I have RNA-Seq data from 7 different experiments (same assay type, just performed on a different day with different samples). I would like to compare the expression of a single gene across these samples. The read data has been aligned and counts obtained (STAR + RSEM).

I've brought in the combined count data, removed genes with low counts (all have at least 10 counts in 3 samples), made a DGEList object within the edgeR framework and ran calcNormFactors().

CPM <- cpm(countdata)
keepthresh <- CPM > 0.5
keep <- rowSums(keepthresh) >= 3
counts.keep <- countdata[keep,]
y <- DGEList(counts.keep)
y <- calcNormFactors(y)

What I want to do now is produce a simple bar plot of the expression level of a single gene across all these samples e.g.

norm_raw <- cpm(y, normalized.lib.sizes = TRUE)
barplot(norm_raw["ENSG00000111640",], names=y$samples$label, las=2, col = y$samples$exp_batch)

The above works fine and gives me the answer I am expecting for my gene of interest (in this case some samples with very low levels of the gene compared to the others).

However, I am very aware that these data come from different batches (confirmed using PCATools) and wondered if this should be taken into account. So I went through the process of batch correction using the following:

log_CPM <- cpm(y, log = TRUE, prior.count = 1, normalized.lib.sizes = TRUE)
log_CPM.batch_corrected <- limma::removeBatchEffect( x = logCPM, batch = y$samples$exp_batch )

When I re-make the bar plot, the expression levels of the gene all level out and are equal across the batches, which is not expected.

As an alternative to batch correction in limma, I also tried internally normalising to a housekeeping gene as follows:

norm_raw <- cpm(y, normalized.lib.sizes = TRUE)
norm_raw_actb <- norm_raw["ENSG00000111640",] / norm_raw["ENSG00000075624",]
barplot(norm_raw_actb, names=y$samples$label, las=2, col = y$samples$exp_batch)

This gives me a barplot that follows a similar expected pattern to the first (non-batch corrected) plot above, but much less pronounced.

My questions are, which is correct? When doing a simple comparison like this, should I batch correct or not? Is an internal house keeping gene comparison enough? Is there a better way to make this sort of comparison?

Many thanks for any advice.

RNA-Seq edgeR batch-effect • 1.5k views

ADD COMMENT • link updated 5 days ago by Ram 43k • written 3.5 years ago by oakhamwolf ▴ 20

score 2 · Accepted Answer · 2020-10-14

2

Entering edit mode

3.5 years ago

i.sudbery 19k

Firstly that the average for each batch is equal is expected - that is explicitly what batch correction does - it makes the average for each batch the same as the average for all other batches. This can be a problem if your batches have different proportions of each experimental treatment in them - so if batch 1 was 33% treatment and 66% control, and batch 2 was 66% treatment and 33% control, then your corrections will be biased. For this reason removeBatchEffect allows you to specify covariates that you don't want to be averaged away. In the worst case, you batches are the same as your experimental conditions. In this case, your design is "perfectly confounded", and there is no way to distinguish which differences are from your experimental conditions and which from batch.

Housekeeping gene correction is not generally considered a particularly good way to adjust RNAseq data, as standard methods (like calcNormFactors) use a far broader range of genes.

However, that doing this reduces the difference between batches for your gene suggests to me that your differences are more likely to be batch related than biological.

ADD COMMENT • link 3.5 years ago by i.sudbery 19k

0

Entering edit mode

Thank you for your reply. It is very much appreciated.

I think I understand what you mean. Just so I have it straight....

RE: Batch correction. When you say "it makes the average for each batch the same as the average for all other batches". Does that "average" refer to the average expression across all genes in each batch or the average for each gene in each batch?. If it's the latter then this explains why my bar plot levels out, as you say.
The data I am using is the control data from 7 experiment batches. Each experiment batch (which I'll be analysing individually) has a number of treatment conditions and an untreated, these control data I'm looking at, are the untreated samples. So in this case, yes you are right, the data is "perfectly confounded" as each control comes from a different batch. So, in the absence of a more ideal transcriptome-wide method, and as this is just a quick look-see comparison for a single gene, would comparing the:

[ raw (i.e. not normalised) CPM of my gene of interest ] / [ raw CPM of Housekeeping gene ]

be a good enough estimate (similar to what is done in the lab), or would I be safer to forget the whole thing?

Thanks again.

ADD REPLY • link 3.5 years ago by oakhamwolf ▴ 20

0

Entering edit mode

For the answer to one - yes its the second option: the average for each gene.

For your second point, it kind of depends on what it is you want to find out with the analysis. Do you expect the different controls to be different from each other?

In general, the assumption you are making with normalising by a housekeeping gene is that the housekeep gene doesn't change between samples. With calcNormFactors, you are assuming that the change between samples averages to 0 across all genes. This is more robust. Consider: you have a gene with 100 reads. Assuming poisson sampling, that would also have a variance of 100 ( or a standard deviation of ten). Very roughly, that means you might expect anywhere between 80 and 120 reads in a sample that had the same expression. But the normalization you get by dividing by 80 is very different to the normalization you'd get by dividing by 120. Thus, if you were going to use housekeeping genes, I suggest using the geometric means of log CPM of a number of housekeeping genes, preferably across the expression level range. I would work in log CPM raher than just CPM and subtract the level of the housekeeping genes.

Generally we worry about batch effects that don't affect every gene the same (e.g. sample 1 has a GC higher GC bias than sample 2, or the fragment length distribution is different meaning that gene length affects count differently between two samples). If we expect the effect to be the same for every gene, then I would expect the standard normalization to work.

As you've only got one "condition" as it were, you can either trust the normalisation and make the leap of faith that your biological signal is bigger than your batch signal (doesn't hurt to see the same pattern with two different normalisation approaches), or if you believe the batch effect might be stronger than the biological effect, accept that this data can't give you the answer you are looking for.

ADD REPLY • link 3.5 years ago by i.sudbery 19k