I'm working with Illumina Miseq reads and i'm having some trouble with variant calling. I used cutadapt for trimming adapters, bwa for alignment and GATK HaplotypeCaller (-dontUseSoftClippedBases) for variant calling. I also used vcfx (http://www.castelli-lab.net/apps/apps_vcfx.php) to better check the calling. I looked at the positions where vcfx marked as interrogated using IGV and saw many reads with soft clipped bases.
I read about soft and hard clipped bases and i thing I understand what they are but it's not clear to me WHAT they are exactly. Part of the read matches the genome (great base and mapping quality) but the soft clipped parts don't match the genome or the adaptors (these bases also have phred score >30, so trimming for quality doesn't help). I did find some sequences like CGTGTCGCTGGTGCGGTCT that show up in many reads. I blast it and it matched to bacteria but not phix...
If anyone can help me understand what these reads might be it would really help me decide what to do with them!