Difference between Hard Clip and Soft Clip in Samtools CIGAR string
2
23
Entering edit mode
6.8 years ago

Hi all,

The documentation for Samtools is minimal at best. I'm still confused on the concept of a clipped read.

1. What is a clipped read? How is it different from a deletion?
2. What is a Soft clip? If the sequence is present in the reference is it different from a mismatch?
3. What is a Hard clip?

Say if I wanted to calculate base pair coverage, would I include soft clipped bases because 'they are present in the <seq>?'

If someone can provide an example such as

REF:    AGTCG GATCG GTACG

That would be even more awesomeer

One last question:

I found a ' * ' as a CIGAR string, what does that mean?

Thanks

samtools hard clip soft clip clipped reads bam • 39k views
3
Entering edit mode

'*' as the cigar string means the read is not aligned so there is no way to show it relative to the reference.

29
Entering edit mode
6.8 years ago
brentp 23k

Hard masked bases do not appear in the SEQ string, soft masked bases do.

So, if your cigar is: 10H10M10H then the SEQ will only be 10 bases long.

if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.

In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.

Both of these maskings are different from deletions. Masking simply means the part of the read can not be aligned to the genome (simplified, but a reasonable assumption for most cases, I think). A deletion means that a stretch of genome is not present in the sample and therefore not in the reads.

I'm not sure when H is used instead of the S and vice-versa. I would like to know that.

2
Entering edit mode

Thank you. I don't know why Samtools documentation is not as clear as your explanation.

1
Entering edit mode

The SAM specification document his on GitHub, so if you want to improve it go ahead!

17
Entering edit mode
6.7 years ago
smithtomsean ▴ 190

My understanding of the choice between soft-clipping and hard-clipping is that hard-clipping is applied when the clipped bases align elsewhere in the reference genome, i.e chimeric reads. At least in bwa this appears to be when hard clipping is used. I'm not sure about other aligners?

Changed the way a chimeric alignment is reported (conforming to the upcoming
SAM spec v1.5). With 0.7.5, if the read has a chimeric alignment, the paired
or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100
bits. All the other hits part of the chimeric alignment will use hard
clipping and be marked with 0x800 if option "-M" is not in use, or marked
with 0x100 otherwise.

As an example, here's part of a bam file with a read pair containing a chimeric read. The top hit is soft clipped and the second top hit is hard clipped and marked as secondary by BWA (-M option).

20692128    97    viral_genome    21417    60    69M32S    chr7    101141242    0    TACATCTTCTCCCTCTCTCACGACACAAGAATTAGTCACATAGGGATGTTCTCGTAAATCTACATTATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT    GGGGGGGGGGGGGGEGGEGGGGGGGGGFGGGGGGGGGGGGGEGFFGGGGGGGFGGFGGGGEGGGGGGGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@    NM:i:4    MD:Z:6G34G6C5C14    AS:i:49    XS:i:0    SA:Z:chr7,101141091,+,66S35M,60,0;
20692128    353    chr7    101141091    60    66H35M    =    101141242    252    ATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT    GGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@    NM:i:0    MD:Z:35    AS:i:35    XS:i:23    SA:Z:gi|224020395|ref|NC_001664.2|,21417,+,69M32S,60,4;
20692128    145    chr7    101141242    60    101M    gi|224020395|ref|NC_001664.2|    21417    0    GCAACAGAGCGAGACCCTATATTCATGAGTGTTGCAATGAGCCAAGTAGTGGAGGTTGGCTTTTGAAGGCAGAAAAGGACTGAGAAAAGCTAACACAGAGA    FEGCGGGGGCGEFCDEEEEGGGGGGGGGGGGGGGEGGGGGGFGGGEGGG