Question

RNAseq analysis set up - viral infection of bacteria

0

Entering edit mode

7 months ago

Sean ▴ 10

Hi there,

I've got somewhat of a tricky situation in deciding how to go about setting up my differential expression analysis in DESeq2 for the experiment that was done and the questions being asked. In this experimental set up, a bacterial strain is either infected or not infected with a species specific phage.

I have three different genotypes, two conditions (infected/uninfected), and four time points. My main concern is whether I should/if it is appropriate to split up count matrices for the bacteria and the phage.

We want to look at which genes are differentially expressed between different genotypes when the bacteria has been infected by the phage, but of course if there is no phage then the number of reads mapping the each phage gene in the uninfected samples is zero across the board. Based on the DESeq2 vignette, I believe I can pre-filter based on a certain minimum value of reads, which would remove the phage genes anyway, and I think this solves my problem of zeros in this case. It seems to be that simply splitting the count matrix into bacterial/phage counts solves this as well, but I'm not sure how this would affect normalization/estimation of statistical parameters.

We also want to look at how the expression of viral genes changes between genotypes over time. In this case, it's only the viral counts that are of interest, and the "uninfected" condition is irrelevant. So this seems like another situation where simply splitting the counts into two separate phage and bacteria matrices also leads to a simplified scenario.

Is there anything inherently wrong with this approach? Or will I be totally messing up the estimation of parameters for the model to be accurate? Any input would be much appreciated as this is my first shot at RNAseq data analysis and it's a bit more complicated (as far I can tell) than I might have hoped for in a first attempt.

Thanks,
Sean

RNA-seq DESeq2 phage • 554 views

ADD COMMENT • link updated 7 months ago by GenoMax 142k • written 7 months ago by Sean ▴ 10

1

Entering edit mode

Can you tell us some stats which may help people trying to answer your question. Total numbers of reads for each sample (were these comparable), Overall % of reads that aligned (bacteria+virus) for each sample, out of the aligned reads what fraction of reads were aligned to bacteria + virus for each sample? Were these numbers consistent across samples? Do you have biological replicates for these samples for all categories?

ADD REPLY • link 7 months ago by GenoMax 142k

0

Entering edit mode

Ah yes, of course. I have biological triplicate for all samples, two time points for each uninfected sample (t=0, and t=30), four time points for each infected sample (t=0, t=10, t=20, t=30), the and the three genotypes, giving a total of 54 samples.

The sequencing was done as paired-end 150 bp, alignments were carried out using bowtie2 in end to end mode (since splicing not a concern) after adapter removal with cutdadapt.

For the tables below c/t denotes uninfected or infected, the number after for timepoints, r1/r2/r3 for each biological replicate. It seems that the total reads aligning to samples is fairly consistent, some slight outliers here and there but no obvious pattern between genotypes, treatments, or time points.

I'm sorry if the formatting is horrendous, I just did markdown tables but not sure the best way to go about it otherwise on here.

TOTAL READS PER SAMPLE

| genotype1 | genotype2 | genotype3 |
| --------- | --------- | --------- |
| c0_r1     | c0_r1     | c0_r1     |
| 9569203   | 9679859   | 10488347  |
| c0_r2     | c0_r2     | c0_r2     |
| 8574603   | 10194832  | 7720184   |
| c0_r3     | c0_r3     | c0_r3     |
| 9718370   | 10907068  | 10218078  |
| c30_r1    | c30_r1    | c30_r1    |
| 10036125  | 9740486   | 11061460  |
| c30_r2    | c30_r2    | c30_r2    |
| 8668646   | 9777186   | 7866943   |
| c30_r3    | c30_r3    | c30_r3    |
| 10373801  | 9798082   | 9179771   |
| t0_r1     | t0_r1     | t0_r1     |
| 10114616  | 9507378   | 10998533  |
| t0_r2     | t0_r2     | t0_r2     |
| 7217179   | 10920584  | 7745430   |
| t0_r3     | t0_r3     | t0_r3     |
| 9362792   | 10147511  | 9923545   |
| t10_r1    | t10_r1    | t10_r1    |
| 8692646   | 12611432  | 8744988   |
| t10_r2    | t10_r2    | t10_r2    |
| 10324839  | 11185887  | 6340977   |
| t10_r3    | t10_r3    | t10_r3    |
| 10064154  | 10128893  | 10359800  |
| t20_r1    | t20_r1    | t20_r1    |
| 10691155  | 10070761  | 9925975   |
| t20_r2    | t20_r2    | t20_r2    |
| 10321115  | 9217399   | 8746588   |
| t20_r3    | t20_r3    | t20_r3    |
| 9688530   | 9611357   | 9282792   |
| t30_r1    | t30_r1    | t30_r1    |
| 8966726   | 11037637  | 9837984   |
| t30_r2    | t30_r2    | t30_r2    |
| 7283270   | 10229603  | 7576092   |
| t30_r3    | t30_r3    | t30_r3    |
| 8571098   | 10128229  | 9364497   |

The averages for genotype_1, genotype_2, and genotype_3 are 9346603, 10271899, 9187888, respectively. So a bit higher for genotype_2 but nothing that seems crazy?

For alignment percentage, all samples had >95% of read pairs aligning concordantly exactly 1 time, <5% aligning concordantly more than 1 time.

PERCENT READS ALIGNING TO BACTERA/VIRUS/PLASMID

| bacterial_frac   | virus_frac       | plasmid_frac     |
| ---------------- | ---------------- | ---------------- |
| genotype1_c0_r1  | genotype1_c0_r1  | genotype1_c0_r1  |
| 99.4             | 0                | 0.6              |
| genotype1_c0_r2  | genotype1_c0_r2  | genotype1_c0_r2  |
| 99.3             | 0                | 0.7              |
| genotype1_c0_r3  | genotype1_c0_r3  | genotype1_c0_r3  |
| 99.3             | 0                | 0.7              |
| genotype1_c30_r1 | genotype1_c30_r1 | genotype1_c30_r1 |
| 99.4             | 0                | 0.6              |
| genotype1_c30_r2 | genotype1_c30_r2 | genotype1_c30_r2 |
| 99.3             | 0                | 0.7              |
| genotype1_c30_r3 | genotype1_c30_r3 | genotype1_c30_r3 |
| 99.4             | 0                | 0.6              |
| genotype1_t0_r1  | genotype1_t0_r1  | genotype1_t0_r1  |
| 98.2             | 1.1              | 0.7              |
| genotype1_t0_r2  | genotype1_t0_r2  | genotype1_t0_r2  |
| 98.5             | 0.9              | 0.6              |
| genotype1_t0_r3  | genotype1_t0_r3  | genotype1_t0_r3  |
| 98.8             | 0.6              | 0.6              |
| genotype1_t10_r1 | genotype1_t10_r1 | genotype1_t10_r1 |
| 95.9             | 3.5              | 0.6              |
| genotype1_t10_r2 | genotype1_t10_r2 | genotype1_t10_r2 |
| 95.6             | 3.7              | 0.6              |
| genotype1_t10_r3 | genotype1_t10_r3 | genotype1_t10_r3 |
| 96.8             | 2.7              | 0.5              |
| genotype1_t20_r1 | genotype1_t20_r1 | genotype1_t20_r1 |
| 91.9             | 7.5              | 0.5              |
| genotype1_t20_r2 | genotype1_t20_r2 | genotype1_t20_r2 |
| 89.9             | 9.6              | 0.5              |
| genotype1_t20_r3 | genotype1_t20_r3 | genotype1_t20_r3 |
| 92.1             | 7.4              | 0.5              |
| genotype1_t30_r1 | genotype1_t30_r1 | genotype1_t30_r1 |
| 88.4             | 11               | 0.5              |
| genotype1_t30_r2 | genotype1_t30_r2 | genotype1_t30_r2 |
| 85.5             | 14               | 0.5              |
| genotype1_t30_r3 | genotype1_t30_r3 | genotype1_t30_r3 |
| 87.5             | 12               | 0.5              |
| genotype2_c0_r1  | genotype2_c0_r1  | genotype2_c0_r1  |
| 99.3             | 0                | 0.7              |
| genotype2_c0_r2  | genotype2_c0_r2  | genotype2_c0_r2  |
| 99.3             | 0                | 0.7              |
| genotype2_c0_r3  | genotype2_c0_r3  | genotype2_c0_r3  |
| 99.3             | 0                | 0.7              |
| genotype2_c30_r1 | genotype2_c30_r1 | genotype2_c30_r1 |
| 99.4             | 0                | 0.6              |
| genotype2_c30_r2 | genotype2_c30_r2 | genotype2_c30_r2 |
| 99.3             | 0                | 0.7              |
| genotype2_c30_r3 | genotype2_c30_r3 | genotype2_c30_r3 |
| 99.3             | 0                | 0.7              |
| genotype2_t0_r1  | genotype2_t0_r1  | genotype2_t0_r1  |
| 98.7             | 0.7              | 0.6              |
| genotype2_t0_r2  | genotype2_t0_r2  | genotype2_t0_r2  |
| 98.7             | 0.5              | 0.8              |
| genotype2_t0_r3  | genotype2_t0_r3  | genotype2_t0_r3  |
| 98.9             | 0.4              | 0.7              |
| genotype2_t10_r1 | genotype2_t10_r1 | genotype2_t10_r1 |
| 97.2             | 2.1              | 0.6              |
| genotype2_t10_r2 | genotype2_t10_r2 | genotype2_t10_r2 |
| 97.7             | 1.7              | 0.5              |
| genotype2_t10_r3 | genotype2_t10_r3 | genotype2_t10_r3 |
| 98               | 1.5              | 0.6              |
| genotype2_t20_r1 | genotype2_t20_r1 | genotype2_t20_r1 |
| 95.4             | 4                | 0.6              |
| genotype2_t20_r2 | genotype2_t20_r2 | genotype2_t20_r2 |
| 93.5             | 6                | 0.5              |
| genotype2_t20_r3 | genotype2_t20_r3 | genotype2_t20_r3 |
| 94.8             | 4.7              | 0.6              |
| genotype2_t30_r1 | genotype2_t30_r1 | genotype2_t30_r1 |
| 94               | 5.5              | 0.5              |
| genotype2_t30_r2 | genotype2_t30_r2 | genotype2_t30_r2 |
| 90.7             | 8.7              | 0.6              |
| genotype2_t30_r3 | genotype2_t30_r3 | genotype2_t30_r3 |
| 91.5             | 7.9              | 0.5              |
| genotype3_c0_r1  | genotype3_c0_r1  | genotype3_c0_r1  |
| 98.7             | 0                | 1.3              |
| genotype3_c0_r2  | genotype3_c0_r2  | genotype3_c0_r2  |
| 99               | 0                | 1                |
| genotype3_c0_r3  | genotype3_c0_r3  | genotype3_c0_r3  |
| 99.1             | 0                | 0.9              |
| genotype3_c30_r1 | genotype3_c30_r1 | genotype3_c30_r1 |
| 98.7             | 0                | 1.3              |
| genotype3_c30_r2 | genotype3_c30_r2 | genotype3_c30_r2 |
| 98.8             | 0                | 1.2              |
| genotype3_c30_r3 | genotype3_c30_r3 | genotype3_c30_r3 |
| 99.2             | 0                | 0.8              |
| genotype3_t0_r1  | genotype3_t0_r1  | genotype3_t0_r1  |
| 96.4             | 2.4              | 1.2              |
| genotype3_t0_r2  | genotype3_t0_r2  | genotype3_t0_r2  |
| 97.7             | 1.4              | 0.9              |
| genotype3_t0_r3  | genotype3_t0_r3  | genotype3_t0_r3  |
| 98.3             | 0.8              | 0.9              |
| genotype3_t10_r1 | genotype3_t10_r1 | genotype3_t10_r1 |
| 92.1             | 6.9              | 1                |
| genotype3_t10_r2 | genotype3_t10_r2 | genotype3_t10_r2 |
| 92               | 7.3              | 0.7              |
| genotype3_t10_r3 | genotype3_t10_r3 | genotype3_t10_r3 |
| 94.3             | 5                | 0.7              |
| genotype3_t20_r1 | genotype3_t20_r1 | genotype3_t20_r1 |
| 88.7             | 10.4             | 0.9              |
| genotype3_t20_r2 | genotype3_t20_r2 | genotype3_t20_r2 |
| 86.7             | 12.8             | 0.4              |
| genotype3_t20_r3 | genotype3_t20_r3 | genotype3_t20_r3 |
| 89.3             | 10               | 0.7              |
| genotype3_t30_r1 | genotype3_t30_r1 | genotype3_t30_r1 |
| 87.8             | 11.2             | 1                |
| genotype3_t30_r2 | genotype3_t30_r2 | genotype3_t30_r2 |
| 86.4             | 12.9             | 0.7              |
| genotype3_t30_r3 | genotype3_t30_r3 | genotype3_t30_r3 |
| 90               | 9.2              | 0.8              |

I think things looks fairly consistent across samples. We see fewer reads aligning to the virus in genotype_2, but that is expected based on what is known about the genotype and this particular infection model. The plasmid in the bacteria also has a fairly consistent number of reads and we shouldn't expect it to vary much across samples. The percent of reads mapping is a bit higher in genotype_3, but that's almost certainly because there is a gene on the plasmid that is not present in the other two genotypes, so I would say that is expected as well.

ADD REPLY • link 7 months ago by Sean ▴ 10

1

Entering edit mode

I think things looks fairly consistent across samples.

Agreed. With a two genome experiment this can be a little tricky but you seem to have consistent data. Hopefully you will have some suggestions soon.

ADD REPLY • link 7 months ago by GenoMax 142k