Question: BWA and Soft clipping
gravatar for christopher.patterson3
3.3 years ago by

Hello, I'm new to the forum here. I am working with a set of Amplicon sequences produced from Illumina (paired-end, read length 251 bp) and aligning using BWA (version 0.7.12), and then variant calling using GATK HaplotypeCaller. Unfortunately my VCF files have a bunch of INDELS that don't appear when I view the read alignments through IGV. All of these variants lie immediately adjacent to the site where BWA has soft-clipped the read. A lot of the reads have had over 200 of their 251 bp, frequently leaving ~20-30 bp for a given alignment, which makes me start to doubt whether the read could even be uniquely mapped to that locus.

On one sample, I ran BWA with a couple of different parameters:

  • Baseline: -M
  • P1: -M -L 251,251
  • P2: -M -L 251,251 -P
  • P3: -L 251,251 -P (with sorting and indexing through Samtools rather than picard)

I used -L to try to prevent all soft-clipping during the SW extension, and -P to skip SW on a proper pair. I wanted to check -M if it the variants were artifacts of excising shorter split reads. The vcf file for each alignment process produced the following number of variants:

  • Baseline: 177 variants, 3.31% reads failing HCMappingQualityFilter
  • P1: 173 variants, 6.18% reads failing HCMappingQualityFilter
  • P2: 33 variants, 22.69% reads failing HCMappingQualityFilter
  • P3: 36 variants, 22.69% reads failing HCMappingQualityFilter

Of those 31 variants were shared by all of the parameters, and 24 I could verify based on the alignment in IGV (20 SNPs, 4 INDELS). While those parameters did refine out many of the clear false positives, even with the -L and -P switches, there are still a lot of reads with soft-clipping.

I guess either I don't completely understand the purposes of the -L switches, which I assumed would stop all soft-clipping at the ends of reads, or why GATK-HC calls variants as deletions when they fall in regions that are soft-clipped (at least in respect to Illumina's amplicon sequencing).

sequencing alignment genome • 3.3k views
ADD COMMENTlink written 3.3 years ago by christopher.patterson30

Have you tried setting --dontUseSoftClippedBases to True? Default = False


ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Zaag770

Yes. Running HTC with that switch does dramatically improve the call set. It cuts it down to 26 variants, and actually one variant that wasn't in P2 or P3 vcfs, that I can see in IGV, but sits right at the edge of the clipped read, so it may just be a false positive. I also figured out how to view the soft-clipped read segments in IGV which helped with the confirmations.

ADD REPLYlink written 3.3 years ago by christopher.patterson30

Not sure why the -L switches don't work correctly on you - but you may want to run GATK's RealignerTargetCreator and IndelRealigner to get rid of false positive indels, as described in htslib's standard workflow here

ADD REPLYlink written 3.3 years ago by Philipp Bayer6.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 718 users visited in the last hour