Question: Difference between Hard Clip and Soft Clip in Samtools CIGAR string
21
gravatar for QVINTVS_FABIVS_MAXIMVS
4.9 years ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS2.3k wrote:

Hi all, 

The documentation for Samtools is minimal at best. I'm still confused on the concept of a clipped read.

  1. What is a clipped read? How is it different from a deletion? 
  2. What is a Soft clip? If the sequence is present in the reference is it different from a mismatch? 
  3. What is a Hard clip?

 

Say if I wanted to calculate base pair coverage, would I include soft clipped bases because 'they are present in the <seq>?' 

If someone can provide an example such as 

REF:    AGTCG GATCG GTACG

Read:   AGTCG xxxCG  GTACG

That would be even more awesomeer

One last question:

I found a ' * ' as a CIGAR string, what does that mean? 

 

Thanks

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by QVINTVS_FABIVS_MAXIMVS2.3k
2

'*' as the cigar string means the read is not aligned so there is no way to show it relative to the reference.

ADD REPLYlink written 4.9 years ago by brentp23k
25
gravatar for brentp
4.9 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Hard masked bases do not appear in the SEQ string, soft masked bases do.

So, if your cigar is: `10H10M10H` then the SEQ will only be 10 bases long.

if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.

In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.

 

Both of these maskings are different from deletions. Masking simply means the part of the read can not be aligned to the genome (simplified, but a reasonable assumption for most cases, I think). A deletion means that a stretch of genome is not present in the sample and therefore not in the reads. 

I'm not sure when H is used instead of the S and vice-versa. I would like to know that.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by brentp23k
15

Hi brentp. My understanding of the choice between soft-clipping and hard-clipping is that hard-clipping is applied when the clipped bases align elsewhere in the reference genome, i.e chimeric reads. At least in bwa this appears to be when hard clipping is used. I'm not sure about other aligners?

bwa-mem 0.7.5 release notes from http://seqanswers.com/forums/showthread.php?t=31237:

"Changed the way a chimeric alignment is reported (conforming to the upcoming
SAM spec v1.5). With 0.7.5, if the read has a chimeric alignment, the paired
or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100
bits. All the other hits part of the chimeric alignment will use hard
clipping and be marked with 0x800 if option "-M" is not in use, or marked
with 0x100 otherwise."

As an example, here's part of a bam file with a read pair containing a chimeric read. The top hit is soft clipped and the second top hit is hard clipped and marked as secondary by BWA (-M option).

20692128    97    viral_genome    21417    60    69M32S    chr7    101141242    0    TACATCTTCTCCCTCTCTCACGACACAAGAATTAGTCACATAGGGATGTTCTCGTAAATCTACATTATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT    GGGGGGGGGGGGGGEGGEGGGGGGGGGFGGGGGGGGGGGGGEGFFGGGGGGGFGGFGGGGEGGGGGGGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@    NM:i:4    MD:Z:6G34G6C5C14    AS:i:49    XS:i:0    SA:Z:chr7,101141091,+,66S35M,60,0;


20692128    353    chr7    101141091    60    66H35M    =    101141242    252    ATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT    GGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@    NM:i:0    MD:Z:35    AS:i:35    XS:i:23    SA:Z:gi|224020395|ref|NC_001664.2|,21417,+,69M32S,60,4;


20692128    145    chr7    101141242    60    101M    gi|224020395|ref|NC_001664.2|    21417    0    GCAACAGAGCGAGACCCTATATTCATGAGTGTTGCAATGAGCCAAGTAGTGGAGGTTGGCTTTTGAAGGCAGAAAAGGACTGAGAAAAGCTAACACAGAGA    FEGCGGGGGCGEFCDEEEEGGGGGGGGGGGGGGGEGGGGGGFGGGEGGG

ADD REPLYlink written 4.8 years ago by smithtomsean170
2

Thank you. I don't know why Samtools documentation is not as clear as your explanation. 

ADD REPLYlink written 4.9 years ago by QVINTVS_FABIVS_MAXIMVS2.3k
1

The SAM specification document his on GitHub, so if you want to improve it go ahead!

ADD REPLYlink written 4.9 years ago by Matt Shirley9.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1462 users visited in the last hour