BWA aligning forward and reverse differently
1
0
Entering edit mode
3.7 years ago

Hi there,

I have amplicon data that I want to align with BWA - I set the softclipping penalty very high (-L 50) to force bwa to align end-to-end rather than soft-clipping in the middle of the read.

Now I have a read pair that is identical in R1 and R2, but I cannot make the aligner output the same alignments for both:

M06706:14:000000000-CN3FL:1:1110:20686:16364 99 xxx   12      43      72M22D2M20D36M  =       12      72      ATGATGGGCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGCAATTCGGCCATCACGGCTCTCCTCCAGTGGGACTCCC            

M06706:14:000000000-CN3FL:1:1110:20686:16364 147        xxx   12      45      72M38S  =       12      -72     ATGATGGGCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGCAATTCGGCCATCACGGCTCTCCTCCAGTGGGACTCCC

There should be two long deletions in the read - as shown in the first one. But in the second, it prefers to soft-clip no matter what I do. I have tried everything, raising the soft-clip penalty even higher, changing match, gap open, gap extension penalties. But no matter what I do, I cannot change the 38S to 22D2M20D36M.

I assume it has to do with the deletions being closer to the start of the second read, but shouldn't the high soft-clip penality force an alignment?

Any help how I can properly align from both directions would be very much appreciated!

bwa mem • 1.9k views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

If you have long deletions, bwa might not be the right tool.

ADD REPLY
0
Entering edit mode

bwa is a short read aligner. Perhaps try minimap2 or another aligner.

ADD REPLY
0
Entering edit mode
3.7 years ago
h.mon 35k

These two reads are a pair, and they aren't identical - although they fully overlap, and one is the reverse-complement of the other. My guess is even with such high penalty, for the second read in the pair soft-clipping must have been cheaper than extending the mapping.

How can I be so certain these reads aren't identical and are a pair? First, see the reads names to confirm they are a pair (read names are identical). Then, check the SAM flags, for the first read the SAM flag is 99:

read paired (0x1)
read mapped in proper pair (0x2)
mate reverse strand (0x20)
first in pair (0x40)

For the second read, the sam flag is 147:

read paired (0x1)
read mapped in proper pair (0x2)
read reverse strand (0x10)
second in pair (0x80)

Thus, we know these reads are a pair, and they are properly mapped. The final piece of the puzzle comes from the SAM format specifications:

All mapped segments in alignment lines are represented on the forward genomic strand. For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented.

The second read, which is mapped to the reverse strand, has been stored reverse-complemented, so in the SAM SEQ field it has identical sequence to its pair, but the original sequence isn't identical.

ADD COMMENT
0
Entering edit mode

thanks, yes I am aware they are not identical but are an R1 R2 pair. What I meant is, they map to the same genomic sequence in forward and reverse direction. The reads contain 2 deletions which occur towards the end of R1, which means they would be at the beginning of R2. It looks like BWA doesn't like opening two large gaps towards the beginning of the read (hence softclips in R2), but will do it towards the end of R1. If I have done the math correctly, the overall alignment score of R1 is 44 (based on BWA's penalties) and for R2 with the softclipping it is 22 (since I set the soft-clipping penalty so high). My question is: why does BWA choose the softclipped alignment over the 2 deletions in R2, when the alignment score is better with 2 deletions.

ADD REPLY
0
Entering edit mode

I see what you mean, and I don't have an answer as to why bwa behaves like it does. My only suggestion would be to try a global aligner, e.g., BBMap or Bowtie2 with the --end-to-end flag.

I have blasted the read, and it maps to 2 different chromosomes in a RefSeq genome - the 72 initial bases match perfectly on one chromosome, and the last 34 bases match perfectly at a different chromosome. So I guess you have constructed a custom reference genome?

ADD REPLY
0
Entering edit mode

yes, I have a custom reference and it is supposed to map to two different locations. Unfortunately, I don't have a choice of aligner in this case and was trying to make it work. But thanks for the suggestions!

ADD REPLY

Login before adding your answer.

Traffic: 2230 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6