Background: I've inherited a new RNAseq data set and am thinking about updating my approaches (last time I did this I was using HISAT and Cuffdiff). I'd like some opinions on best strategies to disentangle/filter out parasite microbe reads from infected host reads before preforming a differential gene expression analysis on the host. Sample type examples: worm (C. elegans), worm + bacterial parasite, worm + mixed microbiome (+/- parasite) RNA reads
I did initial QC on a couple samples with STAR alignment of trimmed/quality filtered RNAseq reads against a combined-all-organisms concatenated genome fasta (with matching concatenated gtf annotation for transcriptome). I think even in samples that don't have every organism, using the some combined genome fasta for mapping is probably the right move (to keep any mutli-mapping biases and so the same). This worked fine to give me a sense of read abundance and multi-aligning frequency. Microbe reads were present, but too low for robust analysis - so my goal is just to get them out of the way (maybe recording abundance to say something generally about presence of the bugs).
Questions: However, I gather using a pseudo-mapper such as Salmon can be faster, more accurate at the transcript level, and pipes nicely into using DeSeq2 for analysis. It is not clear to me though if the pseudo-mapping approach can let me filter out all reads that map against any microbial contigs/transcripts before moving on to DeSeq2 analysis. I think I just get a quantification of each possible transcript and can't really identify anymore which reads that came from, if I understand Salmon's outputs correctly.
I could remove all microbe transcript rows in the output file based on their ID/names probably, but I have some lingering reservations about multi-mapping and such - but this could just be a suspicion based on ignorance as I'm new to this type of mapping approach.
Would it be necessary to use STAR first, remove reads other than those uniquely aligning to the worm, then either use a STAR output bam or filtered read set in Salmon?
Would a Salmon decoy genome composed of concatenated microbial genomes work for this (with the "target" genome being just worm)?
Even after producing alignments with STAR, using Salmon seems to have some advantages for DEG analysis I think - so it won't be a time saver in the end, but still maybe more accurate?
Thank you for any advice!