Question: Design in DESeq: Can you combine explicit and implicit batch effect correction in (SVA+Metadata Variables)?
1
gravatar for Kristin Muench
12 months ago by
United States
Kristin Muench450 wrote:

Hello,

I've heard of two different ways of doing batch effect correction:

  1. Explicit:

    design(myData_collapsed) <- ~ Sex + Genotype
    dds <- DESeq(myData_collapsed)
    deseqOutput <- results(dds, alpha = 0.05, contrasts = c('Genotype', 'A', 'B'))
    
  2. Implicit:

    # make a full model matrix
    mod  <- model.matrix(~ Genotype, colData(myData_collapsed))
    # make a null model to compare it to
    mod0 <- model.matrix(~   1, colData(myData_collapsed))
    
    # calculate SVs
    svseq <- svaseq( assay(myData_collapsed), mod, mod0, n.sv = 4)
    myData_collapsed$SV1 <- svseq$sv[,1]
    myData_collapsed$SV2 <- svseq$sv[,2]
    myData_collapsed$SV3 <- svseq$sv[,3]
    myData_collapsed$SV4 <- svseq$sv[,4]
    
    # redesign matrix
    design(myData_collapsed) <- ~ SV1 + SV2 + SV3 + SV4 + Genotype
    
     # DESeq
    dds <- DESeq(myData_collapsed)
    deseqOutput <- results(dds, alpha = 0.05, contrasts = c('Genotype', 'A', 'B'))
    

I only ever see people doing one of the two strategies above. However, it seems to me that it should be possible to do both, just as long as you make sure to include the explicit batches in the design of your model matrix when calculating SVs, like so:

# make a full model matrix
mod  <- model.matrix(~ Sex + Genotype, colData(myData_collapsed))

# make a null model to compare it to
mod0 <- model.matrix(~   1, colData(myData_collapsed))

# make SVs
svseq <- svaseq( assay(myData_collapsed), mod, mod0, n.sv = 4)

# new DESeq matrix design
myData_collapsed$SV1 <- svseq$sv[,1]
myData_collapsed$SV2 <- svseq$sv[,2]
myData_collapsed$SV3 <- svseq$sv[,3]
myData_collapsed$SV4 <- svseq$sv[,4]
design(myData_collapsed) <- ~ SV1 + SV2 + SV3 + SV4 + Sex + Genotype

# perform DESeq
dds <- DESeq(myData_collapsed)
deseqOutput <- results(dds, alpha = 0.05, contrasts = c('Genotype', 'A', 'B'))

Is that correct?

(Or - just realized this - should the null model include the explicit batch effect, e.g. mod0 <- model.matrix(~ Sex, colData(myData_collapsed)) ?)

Thank you for your help! I appreciate all the help I've been getting from BioStars recently, and I hope I can do a good job at documentation for others Googling this issue.

batch effects deseq2 sva R • 1.1k views
ADD COMMENTlink modified 11 months ago by Teresa20 • written 12 months ago by Kristin Muench450
2

I always go for the explicit when I have the confounding factors measured. Explicit model means you know the batch variables and where they are most likely from and solving the model will tell you how much they are each contributing. In hidden models, you just get a bunch of blobs which you need to dig out what they mean exactly by correlating with known factors.

ADD REPLYlink written 12 months ago by btsui290

I also have problems in detecting if the model I have is good as it is ~ sex + condition or if there is any batch in the data. What it makes me feel a bit confused is the % variance I am obtaining in the PCA plot. Could you help me?

Thanks

Both clusters are separated by gender but not condition but I cant image another variable to add in the design as I only want to compare between case/control population in an homogeneous population

ADD REPLYlink written 11 months ago by Teresa20

Ciao teresa. I moved your question to a comment. In your PCA bi-plot, the percent variation on PC1 is relatively low (relative to the amount of variation expected with a batch effect, which could be up to 80%); however, the form of the 2 'strata' of samples looks too consistent for it to be based on just natural biological variation. You could try to colour the samples based on other metadata that you have in an attempt to identify a logical reason behind the finding.

ADD REPLYlink written 11 months ago by Kevin Blighe46k

Those two groups are separated by gender. When I colour the samples based on it the samples cluster perfectly into both groups: male and female. That's why I have used gender in the design. But I dont know if that is sufficient for the design. I have a question: adding another variable the % assigned would increase? In case that the design is ok with those two variables, would it be a problem presenting that PCA with so low variance in the results?

Thanks a lot Kevin.

ADD REPLYlink written 11 months ago by Teresa20

This is the resulting plot

ADD REPLYlink modified 11 months ago • written 11 months ago by Teresa20

Oh, I see. Have you used plotPCA() from the DESeq2 package, and on the regularised log transformed counts?

Can you retry the PCA but set blind=FALSE when doing the transformation, that is, for example:

rld <- rlog(dds, blind=FALSE)

plotPCA(rld, intgroup=c("sex", "condition"))
ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe46k

I show you the code that I am using in R:

library(DESeq2)

DESeq.ds<-DESeqDataSetFromMatrix(readcounts, colData=coldata, design=~ sex + condition)

DESeq.ds<-estimateSizeFactors(DESeq.ds)

sizeFactors(DESeq.ds)

vsd <- varianceStabilizingTransformation(DESeq.ds, blind = FALSE)

plotPCA(vsd, intgroup=c("condition","sex"))

Would this be right?

ADD REPLYlink written 11 months ago by Teresa20

I am not using rlog because I have 110 samples and it spends a lot of time so I turned to use vst that I have read it is faster.

ADD REPLYlink written 11 months ago by Teresa20

Yes, thew above code would be correct for when using varianceStabilizingTransformation. It produced the same plot?

You just need to do:

library(DESeq2)

DESeq.ds<-DESeqDataSetFromMatrix(readcounts, colData=coldata, design = ~ sex + condition)

DESeq.ds <- DESeq(DESeq.ds)

vsd <- varianceStabilizingTransformation(DESeq.ds, blind = FALSE)

plotPCA(vsd, intgroup=c("condition","sex"))
ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe46k

The above plot was done with that code. I have always plotted the PCA with this code. So, do you think that this design is valid or do I need to add more variables?

ADD REPLYlink written 11 months ago by Teresa20
1

In talking with a colleague (Devon), this type of segregation based on gender would be observed when the condition of interest itself has a weak effect. So, based on this, the assumption is that you may not see many statistically significant genes when you go forward to do differential expression analysis. It is still a good idea, generally, to include sex in the design formula though.

Conversely, if your condition of interest is known to have different pathology / pathobiology between male and female, then that is an explanation too.

There may also be other, unknown biases in the data; however, the % variation on PC1 is actually quite small, at 19%.

So, proceed, but keep in mind this finding.

ADD REPLYlink written 11 months ago by Kevin Blighe46k

Dear kevin,

Sorry for the delay in my response. Thanks for the explanation. I have proceeded with this design and I have obtained 226 DE genes. Is this number too high as we should expect? Thanks!

ADD REPLYlink written 11 months ago by Teresa20

It seems okay. What is the range of P values? Any genes that support your hypothesis(es)?

ADD REPLYlink written 11 months ago by Kevin Blighe46k

Dear Kevin,

Some genes supported the hypothesis as we expected. Nevertheless, I would like to ask you if it would be a good idea separating the dataset by gender. Firstly, I would study only males from both conditions using the design ~condition. The same for females. I dont know if this is posible. The PCA I have obtained from studying only males from both conditions is this one: enter image description here

I cant see the two groups, what do you advice me to do?

ADD REPLYlink modified 11 months ago • written 11 months ago by Teresa20

If you already found genes that are in line with your hypothesis, then that's good. If you divide the dataset into male and female and process them independently, then [hopefully], you would see the same results.

I note that the PC1 % variation in your 'male' plot (just above) is greater than the PC1 % variation in your 'male + female' plot. I think that the conclusion is that gender is not actually a major influence in your dataset.

ADD REPLYlink written 11 months ago by Kevin Blighe46k

Hi Kevin,

I have continued with the pipeline in the male dataset only using condition in the design and I have obtained about 1000 DE genes in the male dataset.

There is a lot of difference compared to the DE genes that I obtained when using the male and female dataset joint (about 100 DE genes). So I think that separating the dataset by gender is not a good idea but I am not able to know the reason.

How can I deal with gender in my analysis?

Thanks a lot

ADD REPLYlink modified 10 months ago • written 10 months ago by Teresa20

Is there any reason why gender would have an influence on the disease that you are studying? Is there any possibility that male and female samples were collected and / or processed differently in your study?

The best way is to keep it (gender) is as a covariate in your design formula. When you did that, the % variation along PC1 was 19%, which is not extra-ordinary, but is not miniscule, either.

ADD REPLYlink written 10 months ago by Kevin Blighe46k

I have plotted the heatmap using the dataset where I keep gender as a covariate and the top DE genes are very highlighted...

One of these genes is PRKY which I have read in the literature:

These genes are worth noting because they are all non-pseudoautosomal X chromosome genes with Y-linked copies (NLGN4Y, PRKY and TMSB4Y) that are expressed in human brain

As I am studying a neurodegenerative disorder, could it make sense?

enter image description here

ADD REPLYlink modified 10 months ago • written 10 months ago by Teresa20

I see that you have been posting here, too: https://support.bioconductor.org/p/79549/ ?

Michael Love is one of the developers of DESeq2 and I can see that he is more or less corroborating what I am saying here.

Gender differences in neurodegenerative diseases are not well understood, but 'tentative' evidence exists in this regard.

Your heatmap looks strange - only a few genes have very high or low values. Are you plotting the unlogged counts? Your heatmap should be produced from rlog or vst counts that are produced with blind=FALSE.

In the other thread, you also have a batch variable - why is that?

Here is some clean and simple code. Can you try this and see what you get?

dds <- DESeqDataSetFromTximport(txi.working, colData=metadata, design= ~ GENDER + Condition)
dds <- DESeq(dds)

vst <- vst(dds, blind=FALSE)
rld <- rlog(dds, blind=FALSE)


res <- results(dds, contrast=c("Condition", "ConditionA", "ConditionB"), independentFiltering=TRUE, alpha=0.05, pAdjustMethod="BH", parallel=FALSE)
res <- lfcShrink(dds, contrast=c("Condition", "ConditionA", "ConditionB"), res=res)
ADD REPLYlink written 10 months ago by Kevin Blighe46k

In the other thread I used the batch variable because I thought I had a batch in my dataset but I have realized that I do not have any batch.

I have used another script now to generate the heatmap as you can see here:

    >     vsd <- varianceStabilizingTransformation(ddsHTSeq, blind=FALSE) 
    >     DESeq.ds <- DESeq(ddsHTSeq) 
    >     res<-results(DESeq.ds) 
    >     resOrdered <- res[order(res$padj),]  
    >     head(resOrdered) 
    >     select_genes<-rownames(subset(resOrdered, padj < 0.05)) 
    >     heatmap.2( assay(vsd)[ select_genes, ], scale="row", 
    >                    trace="none", dendrogram="column", 
    >                    col = colorRampPalette( rev(brewer.pal(9, "RdBu")) )(255),
    >                    ColSideColors = c( Control="gray", DPN="darkgreen", OHT="orange" )[
    >                      colData(vsd)$condition ] )

enter image description here

I cant see any type of clustering of my samples by the condition of interest..

ADD REPLYlink modified 10 months ago • written 10 months ago by Teresa20

It does not appear to be a clear division between your condition of interest; however, there is definitely a higher concentration of green in the cluster on the left in the dendrogram. That may be statistically significant.

You could take those statistically significant genes and reduce them further via, for example, stepwise regression or regularised regression (e.g. lasso-penalsied regression), and then perform ROC analysis.

Were you expecting a clear division between your conditions? It may simply not exist for your disease of interest.

ADD REPLYlink written 10 months ago by Kevin Blighe46k
2
gravatar for Kevin Blighe
12 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

The starting point to any RNA-seq experiment should be a well-designed study that negates the need for any form of batch adjustment / correction. This is often not the case, as we know.

In the 'explicit' part, you are not actually doing any adjustment for batch. You've just got gender (sex) and genotype in the design model - are you implying that gender is your 'batch' effect for which you are aiming to adjust? Gender can be a confounding factor in mixed-gender / transgender RNA-seq studies, in particular for genes transcribed from chromosome X and / or those that have escaped chrX inactivation.

In your 'implicit' part, with your code, you are looking for any residual batch (or other) effects in your data without knowing a priori what they may be. This type of adjustment should not be performed without good reason. Performing it without justification could lead to unnecessary adjustment of your coefficients and the obtaining of unrealistic P values during differential expression. It could be performed in a situation like this: our laboratory receives samples of minimal information and we are told to process them. We realise, after an initial test, that there are likely unknown batch effects, so, we employ SVA to model these effects and adjust for them (as you do).

-----------------------------------------------

batch effect known

When we know what the batch effect is, we should include it in the design formula at the left-most part, e.g.:

design = ~ batch + sex + genotype

This has the effect of modelling the known batch effect and making adjustments for it when it comes to differential expression analysis. No further adjustments from this are necessary unless you wish to use your normalised / transformed counts for other tools, downstream. In such a situation, there are 2 options:

  • use removeBatchEffect() (limma) on the transformed counts to subtract out the batch effect
  • set blind=FALSE when transforming by regularised log or variance stabilisation (DESeq2 only), which has the effect of adjusting the counts based on the modeled batch effect

batch effect(s) unknown

If the batch effect(s) is/are unknown, we can use SVA, as you have done, in which case we arrive at:

design = ~ SV1 + SV2 + Sex + Genotype

NB - include as minimal a number of surrogate variables int he model as possible; otherwise, you'll over-adjust

You can then later subtract out the modelled batch effect as above.


Yet other methods aim to directly adjust raw or normalised counts for either of the situations mentioned above. I go over this here in this previous answer: A: Batch effects : ComBat or removebatcheffects (limma package) ?

Kevin

ADD COMMENTlink modified 12 months ago • written 12 months ago by Kevin Blighe46k

Thank you so much for your detailed answer! I'll address different parts of your post in unique comments to keep the threads separate.

Regarding Sex effects - good point. I referred to it colloquially as a "batch effect" but really what I mean was "factor I think we need to control for when calculating differentially expressed genes." Our study has both male and female samples in both Genotypes, but the Ns aren't even - one Genotype condition has more females. Based on existing literature, I don't think this genotypic aberration should affect sex-linked gene transcription at all. However, if I only do my study with the study design "~Genotype", the differential expression gene list has many X-linked genes - presumably due to the difference in the number of females in one group. Furthermore, when I cluster the rlog-normalized expression data, the samples cluster much better by Sex and less well by Genotype. This leads me to believe that the Sex effect is something that we'd need to 'correct' for in order to get to the real signal, which is how Genotype affects transcription. That's my rationale for including Sex as a "known batch effect."

ADD REPLYlink written 12 months ago by Kristin Muench450

Regarding when to use SVA - thanks, helpful to have an example for how it is deployed. I think I'm curious about the cases that fall into both categories. For example, in our data, I suspect there is a batch effect due to the cell thaw date, or that Sex is a factor to correct for (see other comment). However, even when I include those two factors in my study design, and observe how my data cluster in PCA after using the limma package's observeBatchEffects() function, my data do not cluster as cleanly as I would predict given my knowledge of the effect of this particular Genotype. Therefore, I thought there might be some other batch effect I might need to account for (e.g., I don't know, how long the sequencing core kept the library on the bench during preparation or something).

Once again - thank you so very much - starting a thorough read of the linked post.

ADD REPLYlink modified 12 months ago • written 12 months ago by Kristin Muench450

Oh right, yeh, I see what you mean. There can be both known and unknown effects in the data. The known you can model in the design formula; the unknown can be estimated via things like SVA. The logic there is fine (to use both) but from experience I know that it's generally not a great idea to have a long design formula, i.e., it should be kept as short as possible so as to not 'over-adjust'. One then just relies on (or hopes!) that the experimental design is sufficient enough to control for everything.

I know that, from using SVA, the program returns some sort of estimate of how important is each identified surrogate variable. So, could you get away with just including 1 surrogate variable? Also, the justification for assuming that something like thaw date is having an influence would usually come from a PCA bi-plot - did you generate those and see samples segregating on thaw date?

One thing that I do in RNA-seq analyses is generate PCA bi-plots and colour the samples based on all variables given to me, including things like race, center, RIN, disease state, etc.

ADD REPLYlink written 12 months ago by Kevin Blighe46k

I'll see how it looks only using one SV - definitely don't want to overcorrect.

Previously, I've only plotted the first two PCs and colored by a factor (just as you said - did this by all factors given to me). I did see some clustering by thaw date as well as sex, though no clustering for the other variables. As a side note, if I perform SVA without including Sex/Thaw Date into my design matrix (example 2 above), I see that a couple of the SVs correlate pretty well with Sex and Thaw Date - so in example #2 it seems like some of that variance is being detected through SVs, even though I'm not explicitly including it in my design. That said - it's not like there is a single SV that perfectly corresponds with one of those factors - and I can't say whether or not having an SV account for a known batch effect is as effective as including it in the design.

I've never made a biplot before! I'll give it a try. In my first test just now, I see what looks like my usual PCA plot (where points = samples, colored by factor of interest), with a circle of arrows emanating from the center with each arrow representing a gene. As I understand it, the length of the arrow indicates the relative contribution of that gene to the plotted PCs, with short arrows having negligible effects on those PCs. Two follow up questions: 1. In the case of sequencing data, do you usually filter for some set of genes to plot fewer arrows/make the effects a little easier to see (since otherwise there is just a flurry of thousands of arrows)? 2. If a batch effect was present, in addition to sample clustering, would I expect to see a subset of genes pointing in the same direction?

ADD REPLYlink written 12 months ago by Kristin Muench450

Oh, which code were you using to generate the PCA bi-plots? DESeq2 has it's own built-in function, plotPCA(), and there's some of 'my' code here: A: PCA plot from read count matrix from RNA-Seq

Another critical thing to note is the amount of variance denoted on the axes.

For example, in a situation where there is a strong batch effect in a dataset, PC1 will separate the batches and it will account for >40% variation in the dataset. Less subtle effects may be encountered on higher (less important) PCs, and may only account for a negligible amount of variation. In such cases, there'd be more damage done in trying to adjust for these than not.

Of course, without seeing your data in front of me, it's difficult to give a clear yes / no in relation to how best to analyse it. These are usually judgement calls that are made independently for each study.

ADD REPLYlink written 12 months ago by Kevin Blighe46k

Ah I see!

I was using ggplot for my regular PCAs, and ggbiplots for my biplot:

library(ggbiplot)
g <- ggbiplot(cds.pca.rld_noBatchCorrect, obs.scale = 1, var.scale = 1, 
              groups = sampleTable$Genotype, ellipse = TRUE, 
              circle = TRUE,
              varname.size=0.2) # to functionally remove gene names from cluttering screen
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top') 
print(g)

Using Plot PCA: PC1 accounts for 39% of the variance. PC2 accounts for 25% of the variance.

Color by Genotype: PCA by Genotype

Color by Sex: PCA by Sex

Color by Thaw Batch: enter image description here

I'd interpret these as there being a medium-high effect of Sex, and maybe a low-medium effect of GrowBatch, that might muddy clean clustering in Genotype. I would definitely consider including Sex in my results, but maybe not GrowBatch. Is that correct?

(Also, as a side note - the variance accounted for by each PC is different whether I'm using prcomp(), ggbiplot(), or plotPCA() - eek - I'll need to look into why that is. Genotype clusters much less well using prcomp() or ggbiplot() than it does here, and much less variance is explained. Pic below for ggbiplot() )

Genotype clustering using ggbiplot

ADD REPLYlink written 12 months ago by Kristin Muench450

Yes, looking at that, I would include sex in the design formula. Male:Female are almost exclusively segregated along the diagonal. Looking at Thaw Batch, it looks difficult including that in the design formula. Encouragingly, at least 1 sample from each Thaw Batch overlaps with those of another.

The reason for obtaining different variances by the different programs is because DESeq2's plotPCA() function's default is to filter out a large chunk of genes based on low variance before performing PCA. In this sense, it is biased / supervised PCA.

Other than everything else that I've written, your dataset looks good to go with ~ sex + genotype

ADD REPLYlink written 12 months ago by Kevin Blighe46k

Thank you very much!!

ADD REPLYlink written 12 months ago by Kristin Muench450

I have put the question below. Thanks in advanced !

ADD REPLYlink modified 11 months ago • written 11 months ago by Teresa20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1326 users visited in the last hour