Question

[DESeq2] Can I setup a formula that includes a common time zero and also a two-factor analysis?

0

Entering edit mode

2.9 years ago

chaco001 ▴ 40

Hello,

I have data from a Tn-Seq experiment. We have a barcoded knockout library. We measured the abundance of each knockout at time zero (by taking an aliquot from the common library), then did a 2x2 factorial experiment with 5 replicates per level.

I understand how to use (the excellent) DESeq2 to analyze the factorial structure of my data, by having a two-factor colData data.frame and using a formula with an explicit interaction (e.g. ~ A + B + A:B). This is very useful for telling me how the treatments differ in their effect on relative abundance.

However, it does not incorporate the T0 data. If I could incorporate that, it would also give me information on absolute fitness, which would be very useful for my study.

I have an alternate in-house pipeline that first calculates fitness by finding log2 changes from T0 for each replicate, then just does a two-factor lm on each gene's fitness with subsequent BH corrections, but I think this in-house pipeline doesn't do a great job normalizing gene counts and is giving spurious results.

Is there a way I can do this in DESeq2? I apologize for any lack of clarity. Thanks!

DESeq2 two-way two-factor fitness • 810 views

ADD COMMENT • link updated 2.9 years ago by Carlo Yague 8.6k • written 2.9 years ago by chaco001 ▴ 40

1

Entering edit mode

Hi Chaco001, could you please clarify the following points so that we can provide accurate answers ?

You wrote that you "measure abundance". Abundance of what exactly ? Is your data really Tn-seq ? Because Tn-seq data does not measure transcript abundance. It measures gene essentially, which is inversely proportional to read coverage (reflecting transposon insertion).
If it is indeed Tn-seq data, then it is questionable to use DEseq2 that is a tool optimized to study RNA-seq data. Are you following a Tn-seq-specific pipeline ? edit: after reading your post again, I understand this part better. Actually you might want to use DESeq2 for library normalization, then proceed with your in-house pipeline
From your post, it is not clear what your design is. I understand that you have knock-outs (factor A), and possibly an unknown factor B (since you wrote about 2*2 factorial design). Could you tell us a little more about those two factors ? It is also not clear if this a time course or if you have only data for time zero.

ADD REPLY • link 2.9 years ago by Carlo Yague 8.6k

0

Entering edit mode

Thank you Carlo.

Yes, the data is Tn-Seq, and so it is abundance of transposon insertion, or equivalently the relative frequency of that mutant in the population. It is true that historically this is been treated in just a binary fashion (using a threshold to call things essential / non-essential), but recently groups have been using changes in the insertion frequency across treatments / times to quantitatively assess fitness of the knockout mutants (see e.g. Morin et al https://elifesciences.org/articles/37072). The general idea is that if a given insertion mutant has higher-than-expected representation, that KO has higher fitness than the WT. And therefore the reverse is also true. The typical idea is to do a log2-fold change between a T0 (before the experiment) and T1 (after the treatment) for each gene (averaged in some way across insertions), and use this change as a quantitative estimate of fitness.

In our case, the experiment is growing the Tn-Seq library in the presence / absence of two competitor species. So there are four treatments in a factorial design: the library grown alone, the library in the presence of competitor A, the library in the presence of competitor B, and the library in the presence of competitors A and B. There is also the common T0.

In my mind, applying a tool like DESeq2 to just the final data (without the T0 data) tells us something about how relative insertion # (proportional to relative fitness) changes in a factorial way due to the presence of these two species. So perhaps a KO has neutral fitness when competitor A is present, reduced fitness when competitor B is present, and perhaps there is an interesting interaction where fitness is again neutral when both competitors are present. This would show up in DESeq2 as no significant main effect of species A, (probably) a significant effect of species B, and definitely a significant interaction.

Hopefully this helps, thank you for the clarifying questions.

ADD REPLY • link 2.9 years ago by chaco001 ▴ 40

0

Entering edit mode

Thanks for your detailed clarification. So factor A is the mutation (knock-outs), factor B is the presence/absence of competitor, the time (T0/T1) is currently not taken into account in the model.

To get back to your original question:

If I could incorporate [the T0 data], it would also give me information on absolute fitness.

Absolutely essential genes will be devoid of reads, regardless of the condition so even if you add T0 data in a linear model, you will just be able to identify genes that affect fitness differently in T0 vs (T1NoCompetition), so the time effect. I'm not saying that T0 can not be used to identify absolutely essential genes, just that differential analysis (e.g., DESeq2) is probably not the right tool for that: differential is always relative to a control, it can never be absolute.

ADD REPLY • link 2.9 years ago by Carlo Yague 8.6k