Question

WASP criterion for discarding reads in unphased samples

1

Entering edit mode

6.5 years ago

serpalma.v ▴ 80

Dear community

I am trying to minimize mapping bias by using WASP http://www.nature.com/nmeth/journal/v12/n11/full/nmeth.3582.html. The algorithm first intersects a BAM/SAM file with a VCF file containing the reference SNPs for the species. The original BAM/SAM file is then split into intersecting alignments and non-intersecting alignments.

For a given intersecting alignment, the SNPs within are flipped to their opposite allele, producing new reads (synthetic reads) with the flipped alleles in all possible combinations. These synthetic reads are stored into a new fastq file, which does not contain the original read.

This is an example of how synthetic reads are produced from an original alignment intersecting 3 SNP sites:

1 read with snp1, snp2 and snp3
3 reads with either snp1 & snp2 or snp1 & snp3 or snp2 & snp3
3 reads with either snp2 or snp2 or snp3

So, for an alignment intersecting 3 SNP sites, there are 7 synthetic reads that are stored in a new fastq file without the original read from where they came from.

The paper says that once the flipping is done, a read is remapped and if it does not map to exactly the same location, the read is discarded.

The way I understand this is that if a synthetic read does not map to the same location as the original one, then the alignment for the original read is considered biased.

If this is correct, for a read intersecting one SNP site it is simple to imagine how the algorithm works. However, as in the example above, when from one alignment more multiple synthetic reads are produced, it could occur that some synthetic reads map to the exact same location of the original read and others do not, and there is when I fail to understand how the algorithm works.

Has anyone figured out the way WASP processes alignments intersecting more than one SNP site?

WASP RNA-Seq allele-specific expression • 1.7k views

ADD COMMENT • link 6.5 years ago by serpalma.v ▴ 80

score 1 · Answer 1 · 2017-10-11

I read the the description of the wasp function filter_remapped_reads.py, the one that comes after remapping the synthetic reads.

It says that even if one read of the the synthetic read set matches to a different location than the original read, that original read is discarded.

The function uses 3 files:

the bam file with the reads to keep ("keep.bam", produced by find_intersecting_snps.py)
the bam file of the reads to remap ("to.remap.bam", produced by find_intersecting_snps.py)
the bam file of the synthetic reads mapped ("remapped.bam", produced by hisat2)

The function filter_remapped_reads.py checks that all synthetic reads ("remapped.bam") in a set map to the same location as per "to.remap.bam". If even one fails, the read in "to.remap.bam" is discarded. If the full set maps exactly as the original read, the original read in "to.remap.bam" is put in "keep.bam" and kept.