Hi all,
I am working with a dual RNA-seq dataset that contains reads from both mouse (Mus musculus, eukaryote) and a bacterial species (prokaryote). I aligned the reads to a combined reference genome (mouse + bacteria) using Hisat2.
Now I am at the quantification step and I am unsure how best to proceed:
For mouse, quantification is typically performed at the exon or transcript level using feature count -g exon.
For bacteria, quantification is usually done at the CDS level feature count -g CDS, since there are not many exons in gtf file.
My questions are:
What is the recommended approach for quantification in dual RNA-seq when working with both host (mouse) and bacteria using featureCount.
Should I create separate annotation strategies (exons for mouse, CDS for bacteria) and then combine counts file later? I need raw count file to perform DGE analysis.
Are there any tools/pipelines particularly suited for handling mixed-species quantification in this way?
Any advice or examples of code/scripts for this type of analysis would be very helpful!
Thanks! K
Can you provide alignment stats per genome?
This may not be the best strategy since all reads would be treated as potentially from spliced transcripts, which would not be the case for bacterial transcripts.
Looks like https://pmc.ncbi.nlm.nih.gov/articles/PMC5613115/ does eukaryote genome alignments followed by using unmappeds reads for bacterial genome alignment.
I wonder if using a binning tool such as
bbsplit
(from BBMap suite) followed by alignment to individual genomes would be a better option. This would allow you more control over reads that are mapping across genomes.Hi GenoMax
https://pmc.ncbi.nlm.nih.gov/articles/PMC7249662/ : Here is the paper I was trying to follow.
I have total of 36.6 million paired-end reads were sequenced, of which approximately 30.2 million reads (82.5%) successfully mapped to the reference genomes(combined reference). The majority of reads aligned to the mouse genome, with per-chromosome mean coverage ranging from ~6× to 28×, while bacterial reads were detected mean coverage of 7.57×. Around 6.4 million reads (17.5%) remained unmapped. Among mapped reads, 32.5% were exonic, 52.7% intronic, and 14.8% intergenic.
Thanks for sharing this paper. To be honest, I still find it unclear how quantification was actually performed in these studies.
""We use HTSeq (Anders et al., 2015) or featureCounts (Liao et al., 2014) for read counting, where aligned sequences are quantified as an exon, transcript, gene etc., which results in a separate count matrix for host and pathogen reads, each consisting of genes (rows) and samples (columns)."""
Thanks Again! K
In the paper linked in my comment above they did the alignments sequentially so there were two separate alignment files. As long as you have GTF files for both genomes it is a matter of using
featureCounts
and the two alignment files to get counts.