Question

BWA and Soft clipping

0

Entering edit mode

7.5 years ago

christopher.patterson3 • 0

Hello, I'm new to the forum here. I am working with a set of Amplicon sequences produced from Illumina (paired-end, read length 251 bp) and aligning using BWA (version 0.7.12), and then variant calling using GATK HaplotypeCaller. Unfortunately my VCF files have a bunch of INDELS that don't appear when I view the read alignments through IGV. All of these variants lie immediately adjacent to the site where BWA has soft-clipped the read. A lot of the reads have had over 200 of their 251 bp, frequently leaving ~20-30 bp for a given alignment, which makes me start to doubt whether the read could even be uniquely mapped to that locus.

On one sample, I ran BWA with a couple of different parameters:

Baseline: -M
P1: -M -L 251,251
P2: -M -L 251,251 -P
P3: -L 251,251 -P (with sorting and indexing through Samtools rather than picard)

I used -L to try to prevent all soft-clipping during the SW extension, and -P to skip SW on a proper pair. I wanted to check -M if it the variants were artifacts of excising shorter split reads. The vcf file for each alignment process produced the following number of variants:

Baseline: 177 variants, 3.31% reads failing HCMappingQualityFilter
P1: 173 variants, 6.18% reads failing HCMappingQualityFilter
P2: 33 variants, 22.69% reads failing HCMappingQualityFilter
P3: 36 variants, 22.69% reads failing HCMappingQualityFilter

Of those 31 variants were shared by all of the parameters, and 24 I could verify based on the alignment in IGV (20 SNPs, 4 INDELS). While those parameters did refine out many of the clear false positives, even with the -L and -P switches, there are still a lot of reads with soft-clipping.

I guess either I don't completely understand the purposes of the -L switches, which I assumed would stop all soft-clipping at the ends of reads, or why GATK-HC calls variants as deletions when they fall in regions that are soft-clipped (at least in respect to Illumina's amplicon sequencing).

alignment genome sequencing • 7.4k views

ADD COMMENT • link 7.5 years ago by christopher.patterson3 • 0

3

Entering edit mode

Have you tried setting --dontUseSoftClippedBases to True? Default = False

(HaplotypeCaller.)

ADD REPLY • link 7.5 years ago by Zaag ▴ 860

0

Entering edit mode

Yes. Running HTC with that switch does dramatically improve the call set. It cuts it down to 26 variants, and actually one variant that wasn't in P2 or P3 vcfs, that I can see in IGV, but sits right at the edge of the clipped read, so it may just be a false positive. I also figured out how to view the soft-clipped read segments in IGV which helped with the confirmations.

ADD REPLY • link 7.5 years ago by christopher.patterson3 • 0

0

Entering edit mode

Not sure why the -L switches don't work correctly on you - but you may want to run GATK's RealignerTargetCreator and IndelRealigner to get rid of false positive indels, as described in htslib's standard workflow here

ADD REPLY • link 7.5 years ago by Philipp Bayer 8.3k