I've aligned paired-end Illumina reads to the human genome (indexed with bwa index -i bwtsw, using same version of BWA as for alignment). The library should contain lots of interchromosomal rearrangements, so I'm seeing the expected "[infer_isize] fail to infer insert size: weird pairing" message from bwa sampe. My alignments are ... odd. Some read pairs have no alignments, even though I can align them as single reads with no problem. And, the second read in each pair, in pairs that do both align, is always non-sensical:
SOLEXA1:7:100:1002:190#0 97 chr9 135276597 0 41M chr5 5500811 0 CAGCTACTCAGGAGACTGAGGCTGGGGAATCGCTTGAACCC BB=<@@)BABBB=B9=BB=AB=AB@'9:=94>9>==<>,== XT:A:R NM:i:1 SM:i:0 AM:i:0 X0:i:10 X1:i:518 XM:i:1 XO:i:0 XG:i:0 MD:Z:14G26 SOLEXA1:7:100:1002:190#0 145 chr5 5500811 0 61M chr9 135276597 0 ATTAAAACAATTAAAAAAATAAAATTACAAATGGAAAGGACAAACCAGACCTTACAACTGT B9:>BB>BB?>=BCBC@6@1?@?@26<BBA?BC@8<CCBBBCB;BCCB@BBA>BCCCBAB= XT:A:R NM:i:48 SM:i:0 AM:i:0 X0:i:10 X1:i:518 XM:i:1 XO:i:0 XG:i:0 MD:Z:0G0G0G0T0T0C1A0G0C0G0A0T0T0C0C0C0C0T0G0C0C0T0C0A0G0T1T0C0C2A0G0T2C0T0G0G0G2T1C1G0G1G0C1T0G1C0A0C0
... That's pretty obviously a wrong MD tag. All pairs that "align" (if you can call it that, with an MD string like that) to a single chromosome have non-zero mapping qualities, and are labeled XT:A:U (unique alignments). But all pairs that align to two different chromosomes have zero MQ's, and are labeled XT:A:R (repeat alignments). This is new behavior ... but I haven't yet been able to find the older version of BWA that doesn't behave this way with these reads. The reads are in Illumina's fastq format (phred+64 quality chars), but this happens even if I convert them before alignment. So I'm at a bit of a loss as to why this is happening.
Has anyone else ever seen behavior like this? I'm looking for any clues; anything I can test to figure this out. Thanks in advance ...
EDIT 1: I can BLAT on UCSC and see that they map F/R, with an isize ~150 bp (in one case) ... but the SAM records place them more than overlapping ... the first read is at the correct position (and orientation), but the second read is placed right on top of the first, so that its left edge (on the reference strand) starts before the left edge of the first read. The isize is listed as -5 (1st read) and 5 (2nd read). Could this be a result of the failure to infer insert size? If so, is there any way to set it manually (as there used to be for maq)?
EDIT 2: Nope - bad alignments and ridiculous MD tags are the same after running 'bwa sampe -a 500 ...'