Question

Filtering bam files generated by bwa mem

1

Entering edit mode

8.2 years ago

kulvait ▴ 270

Hi, in my alignments by bwa mem I have found some spots where there is very high number of mismatches. I think these alignments are not correct and might stem from low quality reads or whatever. Region with many mismatches

My bam/sam files are generated by bwa mem and thus I decided to filter them based on the alignment quality and other parameters. First what I noticed is that the MAPQ score from sam spec https://samtools.github.io/hts-specs/SAMv1.pdf is set to 60 for all reads. Thus I assume that bwa mem does not use this parameter and thus I can not filter by MAPQ.

The parameters that bwa mem adds to the resulting bam/sam files are MD:Z:45C12T2A3T31T10C109T48 NM:i:7 AS:i:232 XS:i:19. I found their meaning on this page http://bio-bwa.sourceforge.net/bwa.shtml

MD Mismatching positions/bases
NM Edit distance
AS Alignment score
XS Suboptimal alignment score

I did not find any further documentation for these parameters. Do you know precisely how to understand each of them? Edit distance of indel is probably higher than edit distance of mismatch since if I set Edit distance too low I do not see any INDELs.

The best solution for my problem would be to control "Number of mismatches" in given read. According to http://bio-bwa.sourceforge.net/bwa.shtml this is in XM parameter. But this parameter is not present in my bam/sam files. Do you know how to add this parameter to bwa mem output or how to control output based on number of mismatches?

Thanks, Vojtech.

P.S: Filtering based on custom sam/bam parameters is not straigthforward. I have found this solution http://seqanswers.com/forums/showthread.php?t=24123 and corrected it to work partly but using awk and assuming fixed field number is not the best way to do these things. So I wonder if there is some kind of program that does filtering based on custom fields. Suggested pysam looks very low level for me, I would like some command line utility.

next-gen sequencing SNP • 5.8k views

ADD COMMENT • link 8.2 years ago by kulvait ▴ 270

0

Entering edit mode

Filtering based on custom sam/bam parameters is not straigthforward.

it's easy, using my tool https://github.com/lindenb/jvarkit/wiki/SamJS which has recently moved to picard FilterSamReads : https://broadinstitute.github.io/picard/command-line-overview.html

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

did you try to use the NM tag (edit distance ) ? see also How To Find Out The Mismatchs Of An Alignment Entry In The Sam File?

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

If this is really prevalent, I would trying GATK's IndelRealigner to see if it can fix things up a bit. While BWA tries to map each read on a read-by-read basis, with no consideration given to how many or well other reads mapped to that location, IndelRealigner tries to remap regions where many reads mis-match, indicative of an indel. Your data looks like a classic indel to me, since that is way too many mismatches for a modern sequencer.

Filtering the BAM should really be a last resort. Before that, you should see if your downstream tools even acknowledge these reads, or if they pass those tools built-in filters. Best to keep all your data in one place...

ADD REPLY • link 8.2 years ago by John 13k

0

Entering edit mode

In the samtools spec I found MD tag description

The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string '10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string.

I think its something I can use in mismatch filtering since if I parse this tag correctly, I can count number of mismatches quite well. My problem is twofold. First I did some softcliping of bam alignments not to contain primer sequences of amplicons our dept is using but I did not change MD tag. Thus I need to do that. Fortunately I can do that using command samtools calmd -rb incorrectmd.bam ref.fa > correctmd.bam

-r is for adding another tag called BQ in which there is called Base Alignment Quality (BAQ) and described here http://samtools.sourceforge.net/mpileup.shtml

ADD REPLY • link 8.2 years ago by kulvait ▴ 270

score 0 · Answer 1 · 2016-03-08

It is very unlikely that the mismatches that you observe are caused by low quality measurements bunching up in some region. By the time the DNA makes into the instrument no one knows where it came from.

The reason you get many mismatches is that your DNA differs from the reference (very common to see that) and/or that there are low complexity or repetitive regions. Your alignments individually are still the optimal.

It is very common that the first solution that comes to mind is to get rid of these or to force the alignments to behave certain ways. As others have mentioned it, the goal should be the correct the results that you get from these, for example one can make correct SNP calls even from incorrect alignments.

score 0 · Answer 2 · 2016-03-09

Well, I would also like to find out what is the underlying phenomena behind this. This is a region chr9:139,396,896-139,397,109 which is coverred by Illumina MyeloidAmplicon kit primers. It seems that in the second pair of paired end sequencing there are those errors. It is wired since first member of pair seems OK. But second not. In many samples we sequenced. And the errors seems to be random. GATK realignment does not bring anything new.

I really don't know how to further track down underlying phenomena.

Vojtech.