I am using GATK HaplotypeCaller (v4.5.0.0) for variant calling on human samples, but all reads are being filtered out, and the VCF file only contains the header with no detected variants. My pipeline starts with STAR alignment, using the following command:
STAR --runThreadN 20 --genomeDir "$index" \
--readFilesIn "$fq1" "$fq2" \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--outFileNamePrefix "$OUTPUT_DIR/${base}_"
After alignment, I use Picard MarkDuplicates (v3.3.0) to remove duplicates and index the BAM file:
picard MarkDuplicates \
INPUT="$rg_added_bam" \
OUTPUT="$marked_bam" \
METRICS_FILE="$metrics_file" \
VALIDATION_STRINGENCY=LENIENT \
REMOVE_DUPLICATES=True \
CREATE_INDEX=True
I then run HaplotypeCaller with the following command:
gatk HaplotypeCaller -R ../reference/hg38.fa -I TEST.bam -O TEST.vcf
However, in the program log, I see that all reads are being filtered out due to MappingQualityReadFilter and MappingQualityAvailableReadFilter, as shown in this output:
17:52:14.409 INFO HaplotypeCaller - 2098339 read(s) filtered by: MappingQualityReadFilter
12878897 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
14977236 total reads filtered out of 14977236 reads processed
As a result, my VCF file contains only the header with no variant calls:
source=HaplotypeCaller
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TEST
I have already performed several troubleshooting steps. I checked the BAM file in IGV, and I can clearly see variants present at different loci. I also verified that my input directories and reference genome paths are correct. Additionally, I confirmed that the BAM file and reference genome are sorted properly. To cross-check my pipeline, I ran the exact same workflow on yeast samples, and it successfully detected variants. However, with human samples, no variants are being called.
What could be causing no variants to be detected when using human samples, even though the pipeline works with yeast data?