I'm working with PDX samples, so I expect some mouse contamination in my primarily human DNA sequencing data. To separate reads of mouse and human origin, I generated a new reference genome by concatenating the human and mouse reference genomes. A co-worker's experience lead me to ask whether changing the order would change the number of reads that mapped to the human or mouse chromosomes, and it did. The first time I mapped the reads, I put the human chromosomes first in the concatenated combined genome. When I put the mouse chromosomes first, the number of reads mapped to human chromosomes increased by 120. I mapped about 150 million reads in total using bwa-mem 0.7.6a. I only counted reads with a mapping quality score above 20. When I map the same data twice to the same genome, nothing changes (so this isn't the a general stochastic effect).
Unfortunately I don't have time at the moment to investigate the characteristics of the inconsistently mapped reads. Has anyone else observed changes in mapping when chromosome order changes? Has anyone else looked into the inconsistently mapped reads? Although I'm working with concatenated genomes from different species, I would expect this effect to apply to the many different orders human chromosomes have in various flavors of the hg19 reference genome.
The command line was
bwa mem -t 4 -R [read group stuff] -M ${genome_index} f1.fastq f2.fastq
I de-duped, compressed and sorted after mapping.
Thanks in advance for any input.
[edited to address comments]
This may depend on the way you handle ambiguously-mapped reads. What was the exact command line?
Also, it might help to put bwa-mem in the title and tag the post with bwa.
Thanks for the suggestions.
I might not understand the exact technicalities of PDX sample, but I don't understand chromosome concatenation..? I worked with bacterial DNA-seq, where bacterial was inside mouse host and there was some mouse DNA contamination. We just mapped against our bacterial genome first. Anything that didn't map, we mapped against the host.. That way you have two bam files rather than parsing out particular chromosome of interest from bam file...(I'm guessing that is the way you are doing this).
I don't think there should be any differences in chromosome order. To my (limited) knowledge bwa mem works on assigning scores to best matches, therefore regardless of the chromosome positions you should get the best match i.e the best reads alignment. And also I believe all chromosomes are fragmented anyway at the indexing step..
We get better results by mapping the PDX data to a concatenated mouse-human genome than mapping mouse-depleted reads to the human genome. With this approach, we get ~4% of PDX data mapping to the mouse chromosomes, and basically 0% of normal human data mapping to mouse chromosomes. When mapping directly to the mouse genome in the subtractive approach, 40% of PDX reads and 40% of normal human reads map to mouse. I assume the optimal strategy for reducing mouse contamination in human samples differs from reducing mouse contamination in bacterial samples because of the greater sequence similarity between human and mouse.
Like you, I would expect bwa-mem to report the best match regardless of the chromosome positions. That's not what I'm observing, even only considering reads with mapping quality scores > 20. I hadn't thought about your point about chromosomes being fragmented during the indexing step. To me that suggests that the differences I'm seeing have to do with the order of the content of the chromosomes, not the chromosomes per se.
To be explicit about the concatenated genome, the steps are just
I suggest you give BBSplit a try; it's specifically designed for this problem, and has various settings for how to handle reads that map ambiguously between the two genomes (but not within a genome). So, it should give better results for duplicated regions within a genome.