Soft-Clipped Vs Unmapped?
3
3
Entering edit mode
10.9 years ago
Bioscientist ★ 1.7k

In my eyes, the two look quite similar. Say we have a 100bp read, 50bp of which cannot map while the 50bp can. Then how would BWA categorize this read? Will BWA think this is "unmapped" read since 50bp cannot be mapped; or it's "mapped" but with 50bp "soft-clipped" sequences?

Or BWA has a scoring system for mapping, which sets a threshold for distinguishing the two?

thx

edit: maybe this is related to "centeredness"? say, if breakpoint locates at 99:1; then this 99bp will be mapped with 1bp as "soft-clipped" sequences. But for 50:50, then BWA may regard it as "unmapped"

bwa • 11k views
3
Entering edit mode
10.3 years ago
harremsis ▴ 30

I'm not an expert on read mapping and am also still trying to get to grips with it. But from my experience there are cases in which BWA reports extensively soft-clipped reads as matches. Here's an example from a paired end Illumina sequencing project:

CTCAG_6_1205_14418_171577_2     163     gi|261748867|gb|CM000804.1|     25090342        17      61S20M  =       25090377        116     TGCAGCCCCGCTTTGGTGAAAAAACAAGATAGGAACTGTTGTTGTTCAACTGTACTGTCACCTGCAGCACACACAACCTCC       bbbeeeeegggggiiighhiiiiiiiiiiihiifhiiiiiihiihhhihihihiiiggggggeeeeedddcdccccccccc       RG:Z:FCC0ACBACXX_L6_4   XT:A:M  NM:i:0  SM:i:17 AM:i:17 XM:i:0  XO:i:0  XG:i:0  MD:Z:20


As you can see in the CIGAR string 61S20M 61bp have been soft-clipped from the beginning of the read. The flag 163 (=128+32+2+1) indicates that the read was mapped (4th, i.e. "unmapped", bit is 0), paired, mapped in proper pair, second in pair and that its mate mapped to the reverse strand (check out this great site for decoding SAM bit flags).

So it seems that even with >50% soft-clipping BWA reports reads as mapped. So far I could not figure out how to tell BWA not to do that...which I would actually prefer.

0
Entering edit mode

The mapping quality (5th field) is only 17, which equates to a 0.01995262% chance the mapping is incorrect which is quite high when you are mapping millions of reads.

1
Entering edit mode
10.9 years ago

As I understand the terminology, It will be "mapped" but with 50bp "soft-clipped" sequences. The unmapped have no sequences mapped to the target query.

1
Entering edit mode

I'm just curious how BWA works. the read can still be considered "mapped" even with half of the length cannot be mapped?