(This is my first time to post a question on Biostar, please let me know if you need any more information, thanks)
I am working on insect/bacteria symbiosis. I have some small RNASeq data and I want to see if there is any miRNAs are specifically highly expressed in the host cells that concentrated with bacterial endosymbionts.
The data I have are from three different tissues including:
Bacteriocytes (the cell concentrated with endosymbionts ~90%)
Gut (almost no endosymbiont)
Whole Insect (with endosymbiont, but something in between ~30%)
I have replicates from three different host genotypes. And these data were collected from two batches of NGS sequencing experiments, which generate a batch effect in my data. I need to remove the batch effect before I go ahead and perform the differential expression analysis.
My problem is:
In the principal component analysis using edgeR, I can cluster the data by tissue when I normalized the miRNA read numbers to the number of reads that mapped to host genome and the symbiont genome (reads mapped to host genome + reads mapped to symbiont genome). But this does NOT make sense to me to perform the differential expression analysis, as the reads that mapped to the symbiont will influence the differential expression results, generating almost all miRNAs are downregulated in the bacteriocytes (cells with 90% symbionts reads). However, when I normalized the miRNA reads to the reads that mapped to host genome only, I had a hard time to cluster my data by tissue in the principal component analysis. So my question is which reference libraries (reads mapped to both genomes or reads mapped to host genome only) I should use to normalize my data? And to me, for DE analysis, normalize to reads mapped to host genome only make more sense, so I am wondering if it is appropriate that I go ahead for differential expression analysis without successfully removing the batch effects?