Hi everyone!
I have a lot of (short) RNAseq reads from a metagenome sample. I am only interested in the reads originating from one species in the microbial community, so i would like to filter my fastq files accordingly.
I have a reference genome of the species, so my naive approach would be to map all reads to this reference (let's say with bowtie2) and keep all reads that have mapped for further analysis.
However, this approach seems to introduce a bias that is much larger than i expected. I mapped my data twice: First to a bunch of reference genomes and then only to the single reference that i am interested in. In the second case, the amount of reads mapping to my genome of interest is much higher. Can you explain to me why this happens? Am i now also collecting all the reads of all related species?
Can i overcome the problem by adjusting parameters and making the mapping more strict? Or is the whole approach of mapping to only a single reference just not reasonable?
Thank you! I would like to determine differential gene expression for this one species. So a high level of "contamination" with other reads is a huge problem. Can you recommend an aligner that might be a better fit for what im trying to do?
This issue is not fixable by using a different aligner. If you had UMI's on your reads that would be one way to address their origin for sure.
The samples are from microbial communities, so there is no opportunity to put different tags on the individual species' DNA. Is the right course of action assembly everything in the sample, then mapping the reads, then extracting those which mapped contigs resembling my reference?