DESeq2 design and batch effect
1
2
Entering edit mode
4.8 years ago
georgewwp ▴ 30

Hi all,

I have several questions regarding a RNA-seq analysis project.

My experimental design is as follows. I have 3 genotypes. Each genotype was inoculated with pathogens. Tissue were collected at 0 day as control and 3 day as test. Each experiment has 3 biological replicates. And each sample was sequence on two lanes lane5 and lane6.

The naming system I used is (genotype)(biological replicate)(inoculation)_(lane). For example, genotype Col-0 collected at 0 day was sequenced on lane5 and lane6 to produce file C10_L5 and C10_L6. Genotype Col-0 collected at 3 day was sequenced on lane5 and lane6 to produce file C13_L5 and C13_L6. This is the first biological replicate. So the other two are C20_L5, C20_L6, C23_L5, C23_L6 and C30_L5, C30_L6, C33_L5, C33_L6.

The colData from the DESeq data set looks like this.

DataFrame with 36 rows and 8 columns
folders   center    machine     lane genotype inoculation bioReplicates
<factor> <factor>   <factor> <factor> <factor>    <factor>      <factor>
C10_L5   C10_L5   BGI-WH Hiseq 4000   Lane 5    Col-0          0d             1
C10_L6   C10_L6   BGI-WH Hiseq 4000   Lane 6    Col-0          0d             1
C13_L5   C13_L5   BGI-WH Hiseq 4000   Lane 5    Col-0          3d             1
C13_L6   C13_L6   BGI-WH Hiseq 4000   Lane 6    Col-0          3d             1
C20_L5   C20_L5   BGI-WH Hiseq 4000   Lane 5    Col-0          0d             2
...         ...      ...        ...      ...      ...         ...           ...
F23_L6   F23_L6   BGI-WH Hiseq 4000   Lane 6  2R3R.F1          3d             2
F30_L5   F30_L5   BGI-WH Hiseq 4000   Lane 5  2R3R.F1          0d             3
F30_L6   F30_L6   BGI-WH Hiseq 4000   Lane 6  2R3R.F1          0d             3
F33_L5   F33_L5   BGI-WH Hiseq 4000   Lane 5  2R3R.F1          3d             3
F33_L6   F33_L6   BGI-WH Hiseq 4000   Lane 6  2R3R.F1          3d             3


I used Salmon to quantify the transcripts. I used the default setting of tximport and the output was generated "using counts and average transcript lengths from tximport". I know TPM is recommended for abundance estimate and comparing across samples and batches. Which TPM import option should I use considering my dataset? I vaguely understand the explanations here https://support.bioconductor.org/p/84883/. But which one should I choose? In other words, what different situations are these different options suited for?

From the distant plot, the data from lane5 and lane6 of each sample always cluster together. I guess the batch effect is minimal? I read the DESeq2 work flow. https://www.bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects How can I visualized the difference batch effect correction made in my analysis (DE genes)? Is there a more elaborative instruction somewhere including visualization?

The DESeq2 design I used is

design = ~ genotype + inoculation + genotype:inoculation


Should I include the batch (two lanes) in the model? Or should I just use the sva package and add in the surrogate variables?

Finally, I'm really interested in the interaction term. I want to know the inoculation specific changes in different genotype. I would appreciate if you could share some good papers (RNA-seq studies) with similar design and use the interaction term to explore the underling biology.

Thank you very much!

George

RNA-Seq DESeq2 Salmon batch effect • 3.2k views
1
Entering edit mode

1
Entering edit mode
4.8 years ago

Hi George,

From my perspective, the best option to use is lengthScaledTPM, if at all possible. For me, this represents the ideal normalisation strategy given the data that you've got. To understand which method is actually the 'best', you should probably aim to read the published literature on the topic of RNA-seq normalisation strategies. Methods that produce expression units like FPKM/RPKM and TPM have actually come under some criticism, but to be truthful it's possible to criticise every normalisation strategy to some degree.

Regarding the batch effect, it does appear to be minimal, based on what you've written. I would just include it as a covariate in the DESeq2 design model, and that should suffice. I would not use sva to directly correct for batch.

Hope that this helps Kevin