Hello, I'm new to the forum here. I am working with a set of Amplicon sequences produced from Illumina (paired-end, read length 251 bp) and aligning using BWA (version 0.7.12), and then variant calling using GATK HaplotypeCaller. Unfortunately my VCF files have a bunch of INDELS that don't appear when I view the read alignments through IGV. All of these variants lie immediately adjacent to the site where BWA has soft-clipped the read. A lot of the reads have had over 200 of their 251 bp, frequently leaving ~20-30 bp for a given alignment, which makes me start to doubt whether the read could even be uniquely mapped to that locus.
On one sample, I ran BWA with a couple of different parameters:
- Baseline: -M
- P1: -M -L 251,251
- P2: -M -L 251,251 -P
- P3: -L 251,251 -P (with sorting and indexing through Samtools rather than picard)
I used -L to try to prevent all soft-clipping during the SW extension, and -P to skip SW on a proper pair. I wanted to check -M if it the variants were artifacts of excising shorter split reads. The vcf file for each alignment process produced the following number of variants:
- Baseline: 177 variants, 3.31% reads failing HCMappingQualityFilter
- P1: 173 variants, 6.18% reads failing HCMappingQualityFilter
- P2: 33 variants, 22.69% reads failing HCMappingQualityFilter
- P3: 36 variants, 22.69% reads failing HCMappingQualityFilter
Of those 31 variants were shared by all of the parameters, and 24 I could verify based on the alignment in IGV (20 SNPs, 4 INDELS). While those parameters did refine out many of the clear false positives, even with the -L and -P switches, there are still a lot of reads with soft-clipping.
I guess either I don't completely understand the purposes of the -L switches, which I assumed would stop all soft-clipping at the ends of reads, or why GATK-HC calls variants as deletions when they fall in regions that are soft-clipped (at least in respect to Illumina's amplicon sequencing).