I have several questions regarding a RNA-seq analysis project.
My experimental design is as follows. I have 3 genotypes. Each genotype was inoculated with pathogens. Tissue were collected at 0 day as control and 3 day as test. Each experiment has 3 biological replicates. And each sample was sequence on two lanes lane5 and lane6.
The naming system I used is (genotype)(biological replicate)(inoculation)_(lane). For example, genotype Col-0 collected at 0 day was sequenced on lane5 and lane6 to produce file C10_L5 and C10_L6. Genotype Col-0 collected at 3 day was sequenced on lane5 and lane6 to produce file C13_L5 and C13_L6. This is the first biological replicate. So the other two are C20_L5, C20_L6, C23_L5, C23_L6 and C30_L5, C30_L6, C33_L5, C33_L6.
The colData from the DESeq data set looks like this.
DataFrame with 36 rows and 8 columns folders center machine lane genotype inoculation bioReplicates <factor> <factor> <factor> <factor> <factor> <factor> <factor> C10_L5 C10_L5 BGI-WH Hiseq 4000 Lane 5 Col-0 0d 1 C10_L6 C10_L6 BGI-WH Hiseq 4000 Lane 6 Col-0 0d 1 C13_L5 C13_L5 BGI-WH Hiseq 4000 Lane 5 Col-0 3d 1 C13_L6 C13_L6 BGI-WH Hiseq 4000 Lane 6 Col-0 3d 1 C20_L5 C20_L5 BGI-WH Hiseq 4000 Lane 5 Col-0 0d 2 ... ... ... ... ... ... ... ... F23_L6 F23_L6 BGI-WH Hiseq 4000 Lane 6 2R3R.F1 3d 2 F30_L5 F30_L5 BGI-WH Hiseq 4000 Lane 5 2R3R.F1 0d 3 F30_L6 F30_L6 BGI-WH Hiseq 4000 Lane 6 2R3R.F1 0d 3 F33_L5 F33_L5 BGI-WH Hiseq 4000 Lane 5 2R3R.F1 3d 3 F33_L6 F33_L6 BGI-WH Hiseq 4000 Lane 6 2R3R.F1 3d 3
I used Salmon to quantify the transcripts. I used the default setting of tximport and the output was generated "using counts and average transcript lengths from tximport". I know TPM is recommended for abundance estimate and comparing across samples and batches. Which TPM import option should I use considering my dataset? I vaguely understand the explanations here https://support.bioconductor.org/p/84883/. But which one should I choose? In other words, what different situations are these different options suited for?
From the distant plot, the data from lane5 and lane6 of each sample always cluster together. I guess the batch effect is minimal? I read the DESeq2 work flow. https://www.bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects How can I visualized the difference batch effect correction made in my analysis (DE genes)? Is there a more elaborative instruction somewhere including visualization?
The DESeq2 design I used is
design = ~ genotype + inoculation + genotype:inoculation
Should I include the batch (two lanes) in the model? Or should I just use the sva package and add in the surrogate variables?
Finally, I'm really interested in the interaction term. I want to know the inoculation specific changes in different genotype. I would appreciate if you could share some good papers (RNA-seq studies) with similar design and use the interaction term to explore the underling biology.
Thank you very much!