Why soft clip happens in the last base?
1
0
Entering edit mode
6.5 years ago
chongchu.cs ▴ 10

I am using BWA "mem" (with default setting) align PE illumina reads (with read length 150bp). In some alignments, I notice the "cigar" field is reported as "149M1S" (with flag field 163). I am wondering why the last base is reported as "soft-clip"?

From my understanding of the score strategy, by default "soft-clip" will have a penalty of "5", and mismatch will have a penalty of "4", so report "mismatch" will get higher score. Then why report as "soft-clip"?

Thank you.

alignment next-gen sequencing BWA • 2.3k views
ADD COMMENT
1
Entering edit mode
6.5 years ago

there is no direct relation between the fact that a read is sof-clipped and its score.

It's clipped because the last base of this read doesn't match with the reference genome and you should not start a alignment with a mismatch : cigar operator M/=/X.

ADD COMMENT
0
Entering edit mode

Thank you very much for the reply. But still confused. Say having a read of length 150bp. The alignment cigar is "44S107M". My understanding is: The "44S" happens because there are more than 8 mistaches(suppose no indel) within the 44bp. So penalty for reporting "soft-clip" is smaller than penalty for reporting "44M" (with 8 mismatch), then this is reported as "44S". If this is not the case, in which cases cause the "soft-clip"?

ADD REPLY
0
Entering edit mode

I don't know about the score; look at the SAM record itself for this 44S107M i'm pretty sure, you'll find a SA tag containing some alternate alignments for the 44S section with a 'better' location (~ 44M107S)

ADD REPLY
0
Entering edit mode

Yes, I find the SA tag. But why that is a better location? And back to my original question, 1S happens at the end of the alignment. Why not directly report "150M"?

ADD REPLY
0
Entering edit mode

Why not directly report "150M"?

because an alignment (150M => 149=1X) should not end with a mismatch.

ADD REPLY
0
Entering edit mode

I didn't get the point, why should not end with a mismatch. In the cigar field, "M" can be both Match and Mismatch right? The NM tag will record how many mismatches.

ADD REPLY
0
Entering edit mode

But why that is a better location?

this is a split read. May be it's a inversion, a large deletion, a translocation, etc.... The the correct way is to say: it seems that a part of this read starts here and another part is matching elsewhere.

ADD REPLY

Login before adding your answer.

Traffic: 1537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6