Question: bwa mapping to several similar strains, restrict maximum hits?
gravatar for Adrian Pelin
4.5 years ago by
Adrian Pelin2.2k
Adrian Pelin2.2k wrote:


I want to use bwa mem or aln, to align reads to 3-5 strains of a virus. However, if a read can map with equal quality/score to more than 1 strain, i want the read to be reported as unaligned. Basically, only reads that map to one strain only with a best score should be considered.

The genomes of the 5 strains are in a fasta file, each genome is present in one DNA molecule. The tricky part, is that vaccinia viruses have IR at the 3' and 5' ends of their genomes, so this is a repetitive regions within one strain.

What approach would you guys recommend? For sampe and samse I see the -n option can be set to 1 to accomplish what I need, but that might also mean that IR regions will not have mappings.


fasta bwa sam bwa-mem bwa-aln • 1.7k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 4.5 years ago by Adrian Pelin2.2k
gravatar for Brian Bushnell
4.4 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I think the easiest way to do this would be with BBSplit, which is specifically designed for handling ambiguity between and within organisms differently.  It maps to all references at once and outputs reads to one file per reference, only the one they map to best.  The "ambig" flag allows you to specify the behavior for reads that map ambiguously (discard, retain, randomly place, etc), while the "ambig2" flag allows you to specify the behavior for reads that specifically map ambiguously between references - in other words, it will not affect the behavior of a read that maps to only one organism, but matches two different locations within that one organism.


ADD COMMENTlink written 4.4 years ago by Brian Bushnell16k
gravatar for Matt Shirley
4.5 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

Allow secondary alignments, then sort your alignments by read name and toss any reads that map to multiple strains. That way you've directly answered your biological question of interest.

ADD COMMENTlink written 4.5 years ago by Matt Shirley8.9k

Sounds like a good approach. So with bwa mem it's just the -a parameter. Then I would have to work with the sam file produced. Any hint as how this sorting might happen? I am not really good with the sam format, still trying to learn.

ADD REPLYlink written 4.5 years ago by Adrian Pelin2.2k

You can sort by read name using samtools sort with the -n flag. Take a look at this answer: Can A Bam File Be Sorted By The Read Names

ADD REPLYlink written 4.5 years ago by Matt Shirley8.9k

Okay, I have some progress. I managed to do it and now I see what is going on. A lot of reads are mapping to more than one reference with a perfect match, CIGAR string is only M. But now I have a few problems:

1) There is no score reported in the sam file. When I say score, i mean how well the read mapped. So if it matches perfectly to 3 references and the 4th has only one mismatch, I need to look at the CIGAR string. Is there any way to sort how well the read mapped to the different references? Or is that done already?

2) Because there is no score, there is no way for me to remove reads that match with identical scores to more than one reference.

Am I missing something here?

ADD REPLYlink written 4.5 years ago by Adrian Pelin2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2030 users visited in the last hour