Question: Does PCA results impact on DESeq2 design choices?
gravatar for sc
8 weeks ago by
sc20 wrote:

I've been looking into a SARS-CoV-2 dataset, GSE152075. To put simply, they took RNA from infection positive and infection negative people. My aim is to look at the gene expression of a dozen genes I'm interested in to see if it is different based on infection status.

To find the DEG between positive vs negative using DESeq2, the authors used the design, sequencing_batch + infection_status. See summary below for batch numbers and infection status splits.

Batches; A: 19, B: 15, C: 17, D: 14, E: 23, F: 34, G: 16, H: 21, I: 15, J: 26, K: 10, L: 43, M: 19, N: 13, O: 28, P: 54, Q: 57, R: 14, S: 20, T: 3, U: 16

Infection status; negative = 54, positive = 413

Using these parameters, the vst normalized PCA looks like this (below): PCA-plot

I don't particularly see any significant clustering of the groups except for blue/purple being somewhat separated from pink on PC1 and brown colors (negative infection) from the main group (positive infection) on PC2. Is the correct interpretation that that the batch variable is very weakly linked to variance in the dataset?

Intuitively, to me having 21 batches that are uneven is probably going to make finding any differences significantly harder. However, it is also intuitive that batch effects should be considered. If my interpretation is true, is this a justification for me to exclude sequencing batch from the design?

Or should I be using the LRT function to check the difference in adj. pvals between a design of 'sequence_batch + infection_status' vs 'infection_status' for my genes of interest and deciding based on that?


rna-seq pca deseq2 • 235 views
ADD COMMENTlink modified 8 weeks ago by caggtaagtat1.3k • written 8 weeks ago by sc20
gravatar for caggtaagtat
8 weeks ago by
caggtaagtat1.3k wrote:


looking at the description of the data on GEO, it mentions that they also saw differences in gene expression depending on age and sex of the individuals, as well as infection time course (they say they included it as interaction terms). It makes sense that gene expression could vary between an infected person, who shows very high loads of viral transcripts compared to an infected person who shows a very low viral load in the samples. I would think about including sex, age group and infection course (which could be approximated by viral load) in the DESeq2 design, maybe this leads to a slightly better clustering. Though, clustering of expression data of different individuals will of course show quite a variance just depending on the individual still. That probably also explains, why they chose an treshold for the adjusted p-value of 0.1.

The viral load can be estimated by the amount of reads mapped to SARS-Cov2 transcripts and maybe it can be categorized into high vs low aboundance of viral transcripts. Removing the batch from the design would be not a good idea, I think, because they not only used multiple flow cells for sequencing this bunch of data, but they also used different sequencing mashines (producing 70nt and 100nt long reads).

In my opinion, adjusting the design by checking if genes of interest show the expected outcome would be bad practice and should not be done, since that way, there might be a chance you choose the design wrong to get as many positives as possible, which, however, could be positive for the wrong reasons and you might have to explain this to the reviewer later.

If I didn't miss it, the paper doesn't mention, that they mapped the reads also to the SARS-Cov2 genome, which I think would be beneficial to measure viral load, since not including it could theoretically lead to misalignments of reads which come from viral transcripts but get mapped on the human genome.

ADD COMMENTlink written 8 weeks ago by caggtaagtat1.3k

Thanks for taking the time to answer. This raises several new questions but I don't want to deviate too much from the original question so I'll break this up into two sections:

Original topic:

As far as I can tell, in the series matrix they provided, it appears that all samples were sequenced on the 75bp machine even though as you said, in their methods they wrote they used two different ones. Indeed, age and sex likely factor into the differences in gene expression but to stay on topic, let's say as a hypothetical, there are only batch and infection status considered here.

My question is more in the interpretation of the PCA plot.

1) If you saw such a result, would you interpret it as there is not a very clear batch effect? Or put another way, the primary source of variation on PC1 is not batch related. As for PC2, the only somewhat clear separation can only been seen in batches that are exclusively associated with negative infected samples, thus any batch-related effects are actually infection status related effects.

2) And if indeed, there is no clear batch effect, does this mean that there is really not much benefit to adding it to the design model? Particularly given that negative infected samples are exclusively in only a subset of batches without any corresponding positive infected samples? Should batch ALWAYS be corrected for regardless of whether or not there is batch clustering?

3) As for checking adj. pvals I was referring to the comparison between full and reduced models. From the DESeq2 vignette: "The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero." So if a full model design of: ~batch + infection_status vs a reduced model design of: ~infection_status produces p = 1e-10 for Gene X, I think that it is indicating that Gene X is different by infection status accounting for batch. Thus, batch appears to be important and the full model should be used.

Extended topic:

As a clarification, for infection time course, we can ignore that as it refers to some other data they have. For GSE152075, they do not have this data. They do have viral load as approximated by qPCR CT values of the N1 protein from the virus, however, I wouldn't necessarily say a higher viral load = further along in infection as for example, living with an infected roommate would be different to being infected by a random passer-by.

They did indeed look into age and sex DEG as well but they did it in 3 separate analyses. That is to say, they used a design of:

For infected vs non-infected DEG

Design 1: batch + infection status

For older vs younger DEG

Design 2: age(categorical 60+ or less) + infection_status + age:infection status

For male vs female DEG

Design 3: infection status + sex + infection_status:sex

I think I agree with you that even in the infected vs non-infected DEG analysis that the ideal initial design should probably have been something like: batch + age + sex + viral_load + infection status + age:infection status

These are the factors that intuitively should affect gene expression.

However, with my perhaps naive logic, I think there is also justification to check first if there is a difference in age or gender between infected and non-infected populations. If there is not much difference, a simpler design model such as design 1 would be sufficient. Downstream of DESeq2, plotting of the transformed counts by viral load grouping would then give you an idea of if any of the DEG genes are changed by viral load.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by sc20

1) If you saw such a result, would you interpret it as there is not a very clear batch effect?

All batches (except batch C) fail to include both positive and negative samples. So it is not really possible to distinguish batch effect from virus effect anyway.

ADD REPLYlink written 8 weeks ago by Carlo Yague5.2k

Thanks Carlo, I thought this was the case too.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by sc20

Your welcome, always happy to discuss :)

1) From this plot, I personally would think, that PC1 is indeed batch related, because positives from batch pink are far right and positives from batch blue seem to concentrate to the left (on a PC1 axis that describes 37% of the variance which is not low). I agree that PC2 (11%) seems to separate more between positives and negatives.

2) Let me cite the developer Mr Love on this from a question on the Bioconductor support. “Putting batch in the design formula is our recommendation when batches are known and when they are not confounded with the condition.”. So I would take this as, yes always include the batch effect in the design, unless it gives the same sample separation as the condition (so all infected are batch1 and all uninfected are batch2). Then there is no additional benefit.

3) Ok, I never worked with the LRT, but I think you are right with what the test means. I do not know at what number of genes with adj. pvalue lower e.g. 0.05 you would say, you have to include e.g. the factor sex in your design. I would not use it on the batch effect though, since this is a technical factor, because you might have different primer batches etc.


I wouldn't necessarily say a higher viral load = further along in infection.

Yes you are right, I was mistaken. Viral load is probably also has a greater impact on gene expression than time course of the infection (at least that’s what we saw with HIV in primary cells, where the viral load upon infection did not increase uniformly in all samples, but was phased).

Concering their setting of factors in the design for the different analysis, I would have done it like you said with all factors included and just other interaction terms at the end. That way you account for every variance inducing factor in every comparison, which would be more correct I think. They also only include the batch effect in one of their comparisons.

I think there is also justification to check first if there is a difference in age or gender between infected and non-infected populations

Yeah, if I understood correctly, you can test with the LRT if you should include sex or viral load in the design for the Wald test. But again, I never worked with it, so better read further into that.

ADD REPLYlink written 8 weeks ago by caggtaagtat1.3k

Thanks caggtaagtat, I think to summarize:

Original topic

1) Should we include batch in the design based on this PCA?

No. Well in fact, the PCA doesn't even matter in this case because the negative infected samples are almost all in their own separate batches compared to positives. Thus, you can't delineate batch effects from virus infection effects.

2) But what about the separation we see between blue/purple batches and pinky colors on PC1?

There is batch-related variation on PC1 but is it significant enough to warrant its inclusion at the great cost to power? In my opinion, most are centrally clustered at 0. Blue is more left clustered and only a few pink are right clustered. Perhaps a good compromise would be to take only the samples clustered around -25 to 25 and exclude batch from design.

3) Should we base our decision on LRT comparisons of full and reduced models?

Based on the answer to (1), in this case no.

Extended topic

I think there is also justification to check first if there is a difference in age or gender between infected and non-infected populations

I actually meant for example just a simple boxplot for y = age and x = infection status and doing a t-test. If no difference, it suggests age does not impact on infection status and thus is not necessary to add to design?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by sc20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2307 users visited in the last hour