I'm having some trouble with repetitively mapping Illumina data. I'd like to remove any reads that have a reasonably close second-best hit in the genome/transcriptome, so if the best hit has an edit distance of n, I'd like to remove it if the second-best hit is within n+x (say n < 5, x < 2).
For historical/collaboration reasons, we've been using bwa, which I know is designed to find second-best hits but not necessarily all hits up to a given edit distance. I believe that the
bwa aln -N option is supposed to find all (more or less) hits up to a given edit distance (without the
-N option, it appears to only find second-best hits with up to a given number of mismatches, not including indels). I'm getting garbage for the hits in the XA tag, eg:
readID1 16 chr5 67488388 20 101M * 0 0 ACAGAGGAACCTTCTTAGTGCTCACCGACCTTATACCCAGCTATGGGGACCAGCTCCAGAAAGCACCTTTGCTCCCAATTGGTTAGCTATAGTAAGGTGCG DDDDDDDC?DCDDDDDDDDDDBDDDDFHHHHHHG@JJJJJJIIHJJJJIJJJJJIJJJJJJJIHGJJJJJIIGJJJJJJJJJIJJJJIHHHHHFFFFF@<C XT:A:U NM:i:1 X0:i:1 X1:i:2 XM:i:1 XO:i:0 XG:i:0 MD:Z:43G57 XA:Z:chr5,-67488388,101M,6;chr5,-67488390,2S99M,6;
Where you'll note that the start coordinates are identical or very similar for all three hits. For reference, this is how I'm calling bwa:
bwa aln -b -O2 -n 6 -t 12 -N ref.fa input.bam | bwa samse ref.fa - input.bam | samtools view -h -b -S -o remapped_N.bam -
I've also tried bowtie2 (very slow) and masai (very memory intensive) for mapping. A few questions:
Am I doing something wrong in how I'm calling bwa?
Does anyone have a better mapper for my particular situation? (I'm actually mapping transcriptome data against a genome including an artificial transcriptome as the rna-seq specific mappers don't seem to do well with finding possible second-best hits, but I'm open to suggestions.)
PS I've looked at this post: For short reads, which aligners find all hits given certain edit distance threshold? but I don't think any of the output was checked for accuracy nor speed.