How to see if adjusting batch effect in RNA-seq is working or not
2
8
Entering edit mode
3.6 years ago
iraia.munoa ▴ 130

Hi everybody! I know that there are a lot of questions about how to adjust the RNA-Seq data if there is a batch effect in the samples, but after reading a lot of posts, questions and recommendations, I still have dobts on how to deal with my samples.

I have RNA-seq data from 2 different samples (control and treatment), which I want to compare and result in DEG list. For each sample, I have 3 replicates.

I have used edgeR to normalize the data (method=TMM), and after that I have performed the MDSplot and PCA plot. In both of them I see that there is a batch effect between the samples collected on the same day, as instead of seeing two separated groups between control and treatment, I see 3 groups with pairs of control_1 and treatment_1, control_2 and treatment_2 and control_3 and treatment_3. I see this in both MDSplot and PCAplot.

So I have been reading in biostars, and decided to perform batch effect correction in edgeR defining "batch" and unsing in in the design matrix for the DE analysis.

But to see if this adjusting of the batch effect works I have used "removeBatchEffect" function to then plot MDS and PCA and see if I obtain that separation. I think that for MDS plot more or less it is correct the separation, but again in PCA plot there is something with control_1 and treatment_1 that doesn't work even after the batch adjustment.

How can I solve this to be sure that when performing the DE analysis I am considering properly the samples and replicates, and finally have a good looking MDS and PCA plots?

Thanks in advance!!

Iraia

batch effect edger removeBatchEffect RNA-Seq • 7.2k views
ADD COMMENT
1
Entering edit mode

Please edit your original post and use these instructions to add images inline: How to add images to a Biostars post

ADD REPLY
0
Entering edit mode

Thanks for showing how to do it!

ADD REPLY
1
Entering edit mode

Strange. Which package is the "PCAplot" function from? Why is there a circle? and how much of the variation does the PCA 1 and 2 explain? I assume the "MDSplot" function is edgeR's own right?

How did you use the "removeBatchEffect"? What input did you give?

ADD REPLY
0
Entering edit mode

Hi again! Thanks for answering. I constructed the PCAplot from a short pipeline I found online in a blog. . Maybe someone can suggest a better way to create the PCA plot, I would be very happy if someone could share it with me.

In the before removing batch effect PCAplot the variation explain with PCA1 and PCA2 are:

Importance of components:
PC1     PC2     PC3     PC4     PC5     PC6
Standard deviation     2.4367 0.15322 0.14532 0.08568 0.07612 0.07057
Proportion of Variance 0.9896 0.00391 0.00352 0.00122 0.00097 0.00083
Cumulative Proportion  0.9896 0.99346 0.99698 0.99820 0.99917 1.00000


In the after removing batch effect PCAplot the variation explain with PCA1 and PCA2 is:

Importance of components:
PC1     PC2     PC3     PC4       PC5       PC6
Standard deviation     2.4457 0.08689 0.07729 0.07106 5.027e-15 3.649e-16
Proportion of Variance 0.9969 0.00126 0.00100 0.00084 0.000e+00 0.000e+00
Cumulative Proportion  0.9969 0.99816 0.99916 1.00000 1.000e+00 1.000e+00


On the other way, yes MDSplot function is from edgeR. And now I am going to write the pipeline used for edgeR normalization and plots generation to see if I am correct in the followed steps, and on the same way answerd your question about "removeBatchEffect" use:

> groupApCTRvsPIR [1] "ApCTR" "ApCTR" "ApCTR" "ApPIR" "ApPIR" "ApPIR"
> d_ApCTRvsPIR <- DGEList(counts=counts_ApCTRvsPIR, group=groupApCTRvsPIR)
> d_ApCTRvsPIR
An object of class "DGEList"
$counts ApCTR_1 ApCTR_2 ApCTR_3 ApPIR_1 ApPIR_2 ApPIR_3 MSTRG.1 47 52 48 45 74 49 MSTRG.2 3401 3893 3630 3721 4535 4409 MSTRG.3 3 2 2 8 4 6 MSTRG.4 2733 3103 2905 2920 3834 3496 MSTRG.5 15 27 9 15 30 16 97456 more rows ...$samples
group lib.size norm.factors
ApCTR_1 ApCTR 58364627            1
ApCTR_2 ApCTR 57248845            1
ApCTR_3 ApCTR 53427701            1
ApPIR_1 ApPIR 55418335            1
ApPIR_2 ApPIR 55278860            1
ApPIR_3 ApPIR 56841444            1

> keepApCTRvsPIR = filterByExpr(d_ApCTRvsPIR)
> summary(keepApCTRvsPIR)
Mode   FALSE    TRUE
logical   76710   20751
> d_ApCTRvsPIR_filt  = d_ApCTRvsPIR[keepApCTRvsPIR, ,keep.lib.sizes = FALSE]
> d_ApCTRvsPIR_filt
> TMM_ApCTRvsPIR <- calcNormFactors(d_ApCTRvsPIR_filt, method="TMM")
> logCPM_TMM <- cpm(TMM_ApCTRvsPIR, log=TRUE)
> plotMDS(logCPM_TMM)
> batch <- factor(substring(colnames(counts_ApCTRvsPIR),7,7))
> batch
[1] 1 2 3 1 2 3
Levels: 1 2 3
> logCPM_Batch <- removeBatchEffect(logCPM_TMM, batch=batch)
> plotMDS(logCPM_Batch)
> loadingsTMM.pca <- data.frame(TMM.pca$rotation, .names = row.names(TMM.pca$rotation))
> levels(loadingsTMM.pca$.names) <- c("CTR_1", "CTR_2", "CTR_3", "PIR_1","PIR_2", "PIR_3") > theta <- seq(0,2*pi,length.out = 100) > circle <- data.frame(x = cos(theta), y = sin(theta)) > p <- ggplot(circle,aes(x,y)) + geom_path() > p + geom_text(data=loadingsTMM.pca, mapping=aes(x = PC1, y = PC2,label = .names, colour = .names)) + coord_fixed(ratio=1) + labs(x = "PC1", y = "PC2") > TMM_batch.pca <- prcomp(logCPM_Batch, center = TRUE, scale. = TRUE) > loadingsTMM_batch.pca <- data.frame(TMM_batch.pca$rotation, .names = row.names(TMM_batch.pca$rotation)) > levels(loadingsTMM_batch.pca$.names) <- c("CTR_1", "CTR_2", "CTR_3", "PIR_1","PIR_2", "PIR_3")
> theta <- seq(0,2*pi,length.out = 100)
> circle <- data.frame(x = cos(theta), y = sin(theta))
> p <- ggplot(circle,aes(x,y)) + geom_path()
> p + geom_text(data=loadingsTMM_batch.pca, mapping=aes(x = PC1, y = PC2,label = .names, colour = .names)) + coord_fixed(ratio=1) + labs(x = "PC1", y = "PC2")


I am going to try to add the images in another way following the guides "genomax" mentions.

If you need more information about my followed steps or something more tell me. Any appreciation, recomendation or help will be welcomed! Thansk,

Iraia

ADD REPLY
1
Entering edit mode

For the PCA bi-plots, you should be plotting the x component, not the rotation component. So, it should be TMM.pca$x, not TMM.pca$rotation. DESeq2's PCA function and my own PCAtools package plot the x component.

To read more:

In both of them I see that there is a batch effect between the samples collected on the same day, as instead of seeing two separated groups between control and treatment, I see 3 groups with pairs of control_1 and treatment_1, control_2 and treatment_2 and control_3 and treatment_3. I see this in both MDSplot and PCAplot.

This is not true. In RNA-seq and expression studies, control and treatment do not necessarily have to segregate on a bi-plot.

Perhaps you should explain more about the laboratory preparatory work that you performed: How was RNA isolation and extraction performed?; which kits were used?; how was sequencing performed and to which depth?

I find that many people worry too much about batch effects. The key to tackling batch is in your experimental design, which you need to describe in detail here.

ADD REPLY
0
Entering edit mode

If the poster didn't transpose his count matrix before doing prcomp, then $rotation might be what he wants. ADD REPLY 2 Entering edit mode Not sure about that. It looks like prcomp() was used. The variable names in prcomp are confusing. rotation is just the component loadings, whereas x is the original data (possibly centered and scaled) multiplied by the loadings, which is what people usually plot. I have very rarely seen people make bi-plots of component loadings. Let's do an example. Create random data-matrix with genes as rows and samples as columns: a <- matrix(rexp(200, rate=.1), ncol=20) rownames(a) <- paste("gene", 1:nrow(a)) colnames(a) <- paste("sample", 1:ncol(a)) a[1:5,1:5] sample 1 sample 2 sample 3 sample 4 sample 5 gene 1 6.264247 6.365399 30.5166354 19.325688 1.19008915 gene 2 4.680086 27.040252 2.6847209 11.610688 0.01055533 gene 3 2.503172 50.479092 11.5079630 4.051146 5.99643389 gene 4 7.043575 14.277681 0.4634761 1.370485 12.13477136 gene 5 0.445609 4.140778 4.1722094 16.223713 2.29143657  Perform PCA: pca <- prcomp(a) pca$rotation[1:5,1:5]
PC1         PC2         PC3         PC4         PC5
sample 1 -0.02305171 -0.01458531 -0.11414442 -0.24311839 -0.24582761
sample 2  0.64868127  0.36214413  0.12525206  0.09180544  0.18693405
sample 3 -0.01409071 -0.15733291  0.42270612 -0.19834829  0.08745401
sample 4 -0.09275092  0.05725445  0.28031322 -0.12768724 -0.32535923
sample 5 -0.06538296 -0.54208160  0.08755804  0.58686901  0.15561673
pca$x[1:5,1:5] PC1 PC2 PC3 PC4 PC5 gene 1 -9.840433 -7.2809657 24.992355 -22.372832 2.221724 gene 2 14.308497 18.2965351 17.224035 -4.579115 -7.820376 gene 3 42.684419 13.0095754 6.842768 8.880879 11.166518 gene 4 14.750230 -10.3337231 -21.131364 5.132869 -13.740045 gene 5 -5.112675 0.3574122 -12.997520 -1.780168 -8.400333  Transpose the input data (t(a)) to see what happens: pca <- prcomp(t(a)) pca$rotation[1:5,1:5]
PC1         PC2        PC3          PC4        PC5
gene 1  0.11446789 -0.16100068  0.4355104 -0.287008355  0.3764465
gene 2 -0.26853155  0.31616181  0.4007856 -0.024583085 -0.2631506
gene 3 -0.75803067  0.19414292  0.1707769  0.338856234  0.2240020
gene 4 -0.35964760 -0.12278897 -0.4664235 -0.050958415 -0.3733642
gene 5 -0.03674479  0.06760442 -0.3148615 -0.004084935 -0.1952294
pca$x[1:5,1:5] PC1 PC2 PC3 PC4 PC5 sample 1 4.558026 -4.294325 -5.099243 -18.381199 0.5192141 sample 2 -38.848512 16.742992 10.597158 4.613802 5.2639276 sample 3 4.678583 -15.539581 18.448164 -5.879506 9.9253448 sample 4 9.035912 -2.562873 12.814073 -8.452763 -6.8241778 sample 5 2.691233 -35.816383 -4.755833 18.870795 -0.5086997  So, iraia.munoa, if your input data has genes as rows and samples as columns, you need to run prcomp like this: pca <- prcomp(t(logCPM_TMM))  Then plot the x variable of pca ADD REPLY 1 Entering edit mode The important thing is that you use the x-component and that you use log transformed data - and I would also suggest that you use scale = TRUE in prcomp. ADD REPLY 0 Entering edit mode so you are agree with Kevin Blighe in doing the transposition? I would try it right know and tell you all about the result. Thanks everybody! ADD REPLY 0 Entering edit mode Hi again! I think I have manage to follow your recomendations. I write them right here to see if I am correct (after TMM normalization, log transformation and removing batch effect): > TMM_ApCTRvsPIR <- calcNormFactors(d_ApCTRvsPIR_filt, method="TMM") > batch <- factor(substring(colnames(counts_ApCTRvsPIR),7,7)) > logCPM_TMM <- cpm(TMM_ApCTRvsPIR, log=TRUE) > logCPM_Batch <- removeBatchEffect(logCPM_TMM, batch=batch)  Defining prcomp with scale=TRUE, and transposing to use the x values: > TMM.pca.t <- prcomp(t(logCPM_TMM), center = TRUE, scale = TRUE) > TMM.pca_Batch.t <- prcomp(t(logCPM_Batch), center = TRUE, scale = TRUE) > loadingsTMM.pca.t <- data.frame(TMM.pca.t$x, .names = row.names(TMM.pca.t$x)) > levels(loadingsTMM.pca.t$.names) <- c("CTR_1", "CTR_2", "CTR_3", "PIR_1","PIR_2", "PIR_3")
> loadingsTMM.pca.t
PC1         PC2       PC3       PC4       PC5          PC6  .names
ApCTR_1 -64.40947  -85.795966  34.09740 -31.98830  48.47095 1.946332e-13 ApCTR_1
ApCTR_2 108.09355    2.567291 -38.10096 -64.73288 -13.73351 2.300715e-13 ApCTR_2
ApCTR_3 -59.12795   72.804475 -78.06010  25.13809  27.58929 3.154052e-13 ApCTR_3
ApPIR_1 -33.02887 -102.248500 -21.47183  34.47791 -49.71457 9.847657e-14 ApPIR_1
ApPIR_2 116.55193   15.136745  43.53845  54.63910  20.24116 2.291754e-13 ApPIR_2
ApPIR_3 -68.07920   97.535956  59.99703 -17.53392 -32.85332 1.743705e-13 ApPIR_3

> loadingsTMM.pca_Batch.t <- data.frame(TMM.pca_Batch.t$x, .names = row.names(TMM.pca_Batch.t$x))
> levels(loadingsTMM.pca_Batch.t\$.names) <- c("CTR_1", "CTR_2", "CTR_3", "PIR_1","PIR_2", "PIR_3")
> loadingsTMM.pca_Batch.t
PC1        PC2       PC3           PC4           PC5           PC6 .names
ApCTR_1  -55.95546   74.72812  80.88401 -2.824193e-12 -7.226637e-13  3.925831e-14  CTR_1
ApCTR_2   74.45098  104.87273 -38.73691  1.402325e-12 -6.142028e-13 -2.228373e-13  CTR_2
ApCTR_3  116.44183  -31.14372  63.63609  1.428725e-12 -6.153763e-13  4.426330e-13  CTR_3
ApPIR_1   55.95546  -74.72812 -80.88401 -2.461710e-12 -5.921279e-13  4.649362e-14  PIR_1
ApPIR_2  -74.45098 -104.87273  38.73691  1.484997e-12 -8.796690e-13 -5.541449e-13  PIR_2
ApPIR_3 -116.44183   31.14372 -63.63609  1.314476e-12 -8.050968e-13  5.544632e-13  PIR_3


Plotting the results:

> theta <- seq(0,2*pi,length.out = 100)
> circle <- data.frame(x = cos(theta), y = sin(theta))
> p <- ggplot(circle,aes(x,y)) + geom_path()
> p + geom_text(data=loadingsTMM.pca.t, mapping=aes(x = PC1, y = PC2,label = .names, colour = .names)) + coord_fixed(ratio=1) + labs(x = "PC1", y = "PC2")


> theta <- seq(0,2*pi,length.out = 100)
> circle <- data.frame(x = cos(theta), y = sin(theta))
> p <- ggplot(circle,aes(x,y)) + geom_path()
> p + geom_text(data=loadingsTMM.pca_Batch.t, mapping=aes(x = PC1, y = PC2,label = .names, colour = .names)) + coord_fixed(ratio=1) + labs(x = "PC1", y = "PC2")


Conclusions:

Is this correct? I mean, I am working with this data for my phd project, would it be correct to use it in this way? or someone recomends me to follow other steps? And something more important, after this steps and with that PCAplots would you follow the proccess and do the DE, believing in the results? I would expect that after removing batch effect, all control samples would be grouped in a corner and all treatment samples in the other corner, I mean with a good separation, and I only see this clasification if I draw a line from the upper left corner to the bottom right corner, and I don't know if it is enough to explain correctly sample distribution. Maybe I have an incorrect understanding of how to understand the PCA plot significance in analyzing RNA-Seq data, so any explanaition would be welcomed!

Thanks in advance

ADD REPLY
1
Entering edit mode

The PCA bi-plot looks as expected now, i.e., using the x value. The batch correction seems to have clearly identified a batch effect and adjusted for it, too. You are still not showing the percent explained variation on the axes on the bi-plots, though (?).

ADD REPLY
0
Entering edit mode

Oh! sorry, with my coments I did not want you to say if everything I have done is correct or not. I only want recomendation of which steps to follow or detection if something I have shown here is correct or not. I mean, we don't have bioinformaticians in our research group, so I am trying to understand and learn everything by myself (as I think a lot of researchers do with sequencing analysis) and trying to do the job in the better possible way. So I only wanted to know if after the batch effect adjust it is a good idea to follow the DE analysis, I mean if after looking at the second PCA plot, experts in the field would say "Well, now the sample clusters are better, and it is worthy to follow with the DE analysis", or on the other hand "Well, the adjust is correct, but maybe there are another steps worthy to follow before DE analysis".

Even so, I will try your idea of doing the same DE analysis, with data before adjustment and after adjustment and compare both DEG lists (and of course the comparisons of all the plots you mentioned above).

Answering your questions, the variation percent is:

> summary(TMM.pca.t)
Importance of components:
PC1     PC2     PC3      PC4      PC5       PC6
Standard deviation     87.9078 81.0742 54.1553 45.44693 38.10458 3.389e-13
Proportion of Variance  0.3724  0.3168  0.1413  0.09953  0.06997 0.000e+00
Cumulative Proportion   0.3724  0.6892  0.8305  0.93003  1.00000 1.000e+00
> summary(TMM.pca_Batch.t)
Importance of components:
PC1     PC2     PC3       PC4       PC5       PC6
Standard deviation     94.3031 83.7914 69.5480 2.139e-12 8.367e-13 4.416e-13
Proportion of Variance  0.4286  0.3383  0.2331 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion   0.4286  0.7669  1.0000 1.000e+00 1.000e+00 1.000e+00


68% and 77% of cumulative variance is a small proportion?

And finally, my goal of using removeBatchEffect() and doing before and after MDS and PCAplots was to see if after adjusting the batch effect, the clusterization of samples was better. But as I have read, I can't use the data after removing the batch effect for DEG analysis right? I need to do the edgeR protocol (or DESeq) but taken into account the "batch" in the matrix design right?

Thanks again

ADD REPLY
1
Entering edit mode

I see what you mean. For your differential expression analysis, just include batch in the design formula and use the un-adjusted normalised counts. The batch effect looks consistent in your study, so, the regression model should be easy to fit by EdgeR's internal engine.

Please take time to read responses here: https://support.bioconductor.org/p/87718/

ADD REPLY
0
Entering edit mode

Thanks again for all your answers, you have helped me a lot! If someone more wants to add some other points of view they are also welcome to writ in this post.

What about the variance values? Are they enough to focud on PC1 vs PC2 or do I need to change to PC3 or something like that?

ADD REPLY
0
Entering edit mode

Regarding the explained variation, nothing seems out of the ordinary to me.

ADD REPLY
0
Entering edit mode

Just regarding these lines:

TMM_ApCTRvsPIR <- calcNormFactors(d_ApCTRvsPIR_filt, method="TMM")
logCPM_TMM <- cpm(TMM_ApCTRvsPIR, log=TRUE)
plotMDS(logCPM_TMM)


This should be:

TMM_ApCTRvsPIR <- calcNormFactors(d_ApCTRvsPIR_filt, method="TMM")
plotMDS(TMM_ApCTRvsPIR)


..or:

TMM_ApCTRvsPIR <- calcNormFactors(d_ApCTRvsPIR_filt, method="TMM")
logCPM_TMM <- cpm(TMM_ApCTRvsPIR, log=TRUE, prior.count = 2)
plotMDS(logCPM_TMM)


Take a look here: https://support.bioconductor.org/p/116614/#116621

ADD REPLY
0
Entering edit mode

I didn't know about "prior.count". I have read what Gordon says in the post you linked, and I have used it in my samples, and the MDSplot looks the same. I don't understand really well why I should use it. It is to avoid that 0 counts become Inf when doing the log transformation? And how do I know which number to chose there? why 2? Thanks,

ADD REPLY
0
Entering edit mode

To follow with RamRS answer, I didn't know the need to transpose the count matrix, and I simply followed a pipeline posted in my first reply. I though that it was correct to use it as a global value for each variables, as they explain the rotations value as coefficients of the linear combinations of the continuous variables.

ADD REPLY
4
Entering edit mode
3.6 years ago
Gordon Smyth ★ 4.8k

It seems that everything is actually working correctly, except that you are not making the PCA plot correctly.

As you already know, to make an MDS plot in edgeR, you can type

plotMDS(logCPM_TMM)


To make an PCA plot, you just use

plotMDS(logCPM_TMM, gene.selection="common")


When you set gene.selection="common", you will see that the axis labels on the plot change from "Leading logFC" to "Principal Component" to indicate that it is a PCA plot.

I'm not sure why you would need a PCA plot, since it is unlikely to add anything over the MDS plot but, if you want one, it's as easy as that.

ADD COMMENT
1
Entering edit mode

Thanks Gordon! I didn't know it was as easy as that! I also will have into account the term prior.count when doing logcpm values. I have seen your explanation in support.bioconductot, so everything is clear now. I am really happy that you guys concers about helping beginers with your knowledge, Thanks again!

Pd: is it posible to add to that pca plot in edgeR an ellipse grouping control and treatment samples?

ADD REPLY
0
Entering edit mode

And is it possible to add variance % explained por PC1 and PC2 in the axes?

ADD REPLY
0
Entering edit mode

Can you not search for how to do this?

ADD REPLY
2
Entering edit mode
3.6 years ago

I think there are two different definitions of "working":

1) Checking if there is a bug in the code - The code doesn't work as described (in terms of carrying out the individual steps that it is supposed to, without crashing or having unexpected errors like "NA" values). This can be identified and fixed, and I think that is the focus of the responses above (in other words, if you know a package as a function, check that you are using the correct commands for the function that you want to use.

2) Assessment of the normalization/adjustment (concerns about over-fitting, etc.) - I think this is harder to define in a way that applies equally well to all projects. However, I think checking expression before and after the extra step may be important.

This can be a little tricky in situations where you include co-variates that affect clustering. In other words, if the concern is that the normalization makes pre-defined groups artificially cluster together more than can be robustly observed in other studies (which I would consider "over-fitting" and/or "over-correction"), then the overall clustering in a PCA plot may be difficult to assess.

If I had to give some possibly useful recommendations (that may or may not help for a particular project), this is what I can think of:

a) Visually check clustering of replicates in the heatmap. I think this matters more for the p-value calculation (checking that good concordance among replicates was indeed defined), but I admittedly use multi-variate p-value models (tested with different methods for each project) more often than a different normalization (or, just for the heatmap, I might do something like center by batch/tissue/cell prior to setting the overall mean to 0 and standard deviation of 1).

b) Visually inspect some individual candidate genes (possibly positive controls from your experiment). For example, if you are correcting an effect, make sure that the groups are randomized and you can observe the same trend if you separate groups by the variable you are trying to correct. For example, if you have 3 batches and 2 groups, I would check i) is one sample from each group in each batch and ii) assuming your candidate is up-regulated for each treatment, can you confirm that expression is increased for each treatment relative to each control for each separate batch? The adjustment may make this easier to see, but I think you still need the original expression (possibly in a supplemental figure) do confirm you haven't done something weird (like change the relative trend in expression within batches).

If there are other ideas of how to address #2, I would love to hear them!

ADD COMMENT

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6