I am trying to minimize mapping bias by using WASP http://www.nature.com/nmeth/journal/v12/n11/full/nmeth.3582.html. The algorithm first intersects a BAM/SAM file with a VCF file containing the reference SNPs for the species. The original BAM/SAM file is then split into intersecting alignments and non-intersecting alignments.
For a given intersecting alignment, the SNPs within are flipped to their opposite allele, producing new reads (synthetic reads) with the flipped alleles in all possible combinations. These synthetic reads are stored into a new fastq file, which does not contain the original read.
This is an example of how synthetic reads are produced from an original alignment intersecting 3 SNP sites:
- 1 read with snp1, snp2 and snp3
- 3 reads with either snp1 & snp2 or snp1 & snp3 or snp2 & snp3
- 3 reads with either snp2 or snp2 or snp3
So, for an alignment intersecting 3 SNP sites, there are 7 synthetic reads that are stored in a new fastq file without the original read from where they came from.
The paper says that once the flipping is done, a read is remapped and if it does not map to exactly the same location, the read is discarded.
The way I understand this is that if a synthetic read does not map to the same location as the original one, then the alignment for the original read is considered biased.
If this is correct, for a read intersecting one SNP site it is simple to imagine how the algorithm works. However, as in the example above, when from one alignment more multiple synthetic reads are produced, it could occur that some synthetic reads map to the exact same location of the original read and others do not, and there is when I fail to understand how the algorithm works.
Has anyone figured out the way WASP processes alignments intersecting more than one SNP site?