Question: Difference between Hard Clip and Soft Clip in Samtools CIGAR string
The documentation for Samtools is minimal at best. I'm still confused on the concept of a clipped read.

  1. What is a clipped read? How is it different from a deletion? 
  2. What is a Soft clip? If the sequence is present in the reference is it different from a mismatch? 
  3. What is a Hard clip?


Say if I wanted to calculate base pair coverage, would I include soft clipped bases because 'they are present in the <seq>?' 

Read:   AGTCG xxxCG  GTACG

I found a ' * ' as a CIGAR string, what does that mean? 



'*' as the cigar string means the read is not aligned so there is no way to show it relative to the reference.

brentp wrote:

Hard masked bases do not appear in the SEQ string, soft masked bases do.

So, if your cigar is: `10H10M10H` then the SEQ will only be 10 bases long.

if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.

In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.


Both of these maskings are different from deletions. Masking simply means the part of the read can not be aligned to the genome (simplified, but a reasonable assumption for most cases, I think). A deletion means that a stretch of genome is not present in the sample and therefore not in the reads. 

I'm not sure when H is used instead of the S and vice-versa. I would like to know that.

Hi brentp. My understanding of the choice between soft-clipping and hard-clipping is that hard-clipping is applied when the clipped bases align elsewhere in the reference genome, i.e chimeric reads. At least in bwa this appears to be when hard clipping is used. I'm not sure about other aligners?

bwa-mem 0.7.5 release notes from

"Changed the way a chimeric alignment is reported (conforming to the upcoming
SAM spec v1.5). With 0.7.5, if the read has a chimeric alignment, the paired
or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100
bits. All the other hits part of the chimeric alignment will use hard
clipping and be marked with 0x800 if option "-M" is not in use, or marked
with 0x100 otherwise."

As an example, here's part of a bam file with a read pair containing a chimeric read. The top hit is soft clipped and the second top hit is hard clipped and marked as secondary by BWA (-M option).


20692128    353    chr7    101141091    60    66H35M    =    101141242    252    ATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT    GGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@    NM:i:0    MD:Z:35    AS:i:35    XS:i:23    SA:Z:gi|224020395|ref|NC_001664.2|,21417,+,69M32S,60,4;


Thank you. I don't know why Samtools documentation is not as clear as your explanation. 

The SAM specification document his on GitHub, so if you want to improve it go ahead!

