Question: BWA aligning forward and reverse differently
gravatar for dorothea.emig
27 days ago by
dorothea.emig0 wrote:

Hi there,

I have amplicon data that I want to align with BWA - I set the softclipping penalty very high (-L 50) to force bwa to align end-to-end rather than soft-clipping in the middle of the read.

Now I have a read pair that is identical in R1 and R2, but I cannot make the aligner output the same alignments for both:



There should be two long deletions in the read - as shown in the first one. But in the second, it prefers to soft-clip no matter what I do. I have tried everything, raising the soft-clip penalty even higher, changing match, gap open, gap extension penalties. But no matter what I do, I cannot change the 38S to 22D2M20D36M.

I assume it has to do with the deletions being closer to the start of the second read, but shouldn't the high soft-clip penality force an alignment?

Any help how I can properly align from both directions would be very much appreciated!

bwa mem • 140 views
ADD COMMENTlink modified 27 days ago by h.mon31k • written 27 days ago by dorothea.emig0

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.

ADD REPLYlink written 27 days ago by RamRS30k

If you have long deletions, bwa might not be the right tool.

ADD REPLYlink written 27 days ago by swbarnes28.6k

bwa is a short read aligner. Perhaps try minimap2 or another aligner.

ADD REPLYlink written 27 days ago by Istvan Albert ♦♦ 84k
gravatar for h.mon
27 days ago by
h.mon31k wrote:

These two reads are a pair, and they aren't identical - although they fully overlap, and one is the reverse-complement of the other. My guess is even with such high penalty, for the second read in the pair soft-clipping must have been cheaper than extending the mapping.

How can I be so certain these reads aren't identical and are a pair? First, see the reads names to confirm they are a pair (read names are identical). Then, check the SAM flags, for the first read the SAM flag is 99:

read paired (0x1)
read mapped in proper pair (0x2)
mate reverse strand (0x20)
first in pair (0x40)

For the second read, the sam flag is 147:

read paired (0x1)
read mapped in proper pair (0x2)
read reverse strand (0x10)
second in pair (0x80)

Thus, we know these reads are a pair, and they are properly mapped. The final piece of the puzzle comes from the SAM format specifications:

All mapped segments in alignment lines are represented on the forward genomic strand. For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented.

The second read, which is mapped to the reverse strand, has been stored reverse-complemented, so in the SAM SEQ field it has identical sequence to its pair, but the original sequence isn't identical.

ADD COMMENTlink written 27 days ago by h.mon31k

thanks, yes I am aware they are not identical but are an R1 R2 pair. What I meant is, they map to the same genomic sequence in forward and reverse direction. The reads contain 2 deletions which occur towards the end of R1, which means they would be at the beginning of R2. It looks like BWA doesn't like opening two large gaps towards the beginning of the read (hence softclips in R2), but will do it towards the end of R1. If I have done the math correctly, the overall alignment score of R1 is 44 (based on BWA's penalties) and for R2 with the softclipping it is 22 (since I set the soft-clipping penalty so high). My question is: why does BWA choose the softclipped alignment over the 2 deletions in R2, when the alignment score is better with 2 deletions.

ADD REPLYlink written 26 days ago by dorothea.emig0

I see what you mean, and I don't have an answer as to why bwa behaves like it does. My only suggestion would be to try a global aligner, e.g., BBMap or Bowtie2 with the --end-to-end flag.

I have blasted the read, and it maps to 2 different chromosomes in a RefSeq genome - the 72 initial bases match perfectly on one chromosome, and the last 34 bases match perfectly at a different chromosome. So I guess you have constructed a custom reference genome?

ADD REPLYlink written 26 days ago by h.mon31k

yes, I have a custom reference and it is supposed to map to two different locations. Unfortunately, I don't have a choice of aligner in this case and was trying to make it work. But thanks for the suggestions!

ADD REPLYlink written 26 days ago by dorothea.emig0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1817 users visited in the last hour