Question

HaplotypeCaller Not Detecting Variants

1

Entering edit mode

9 months ago

akshit ▴ 10

I am using GATK HaplotypeCaller (v4.5.0.0) for variant calling on human samples, but all reads are being filtered out, and the VCF file only contains the header with no detected variants. My pipeline starts with STAR alignment, using the following command:

STAR --runThreadN 20 --genomeDir "$index" \
     --readFilesIn "$fq1" "$fq2" \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesCommand zcat \
     --outFileNamePrefix "$OUTPUT_DIR/${base}_"

After alignment, I use Picard MarkDuplicates (v3.3.0) to remove duplicates and index the BAM file:

picard MarkDuplicates \
    INPUT="$rg_added_bam" \
    OUTPUT="$marked_bam" \
    METRICS_FILE="$metrics_file" \
    VALIDATION_STRINGENCY=LENIENT \
    REMOVE_DUPLICATES=True \
    CREATE_INDEX=True

I then run HaplotypeCaller with the following command:

gatk HaplotypeCaller -R ../reference/hg38.fa -I TEST.bam -O TEST.vcf

However, in the program log, I see that all reads are being filtered out due to MappingQualityReadFilter and MappingQualityAvailableReadFilter, as shown in this output:

17:52:14.409 INFO  HaplotypeCaller - 2098339 read(s) filtered by: MappingQualityReadFilter  
12878897 read(s) filtered by: MappingQualityAvailableReadFilter  
0 read(s) filtered by: MappedReadFilter  
0 read(s) filtered by: NotSecondaryAlignmentReadFilter  
0 read(s) filtered by: NotDuplicateReadFilter  
0 read(s) filtered by: PassesVendorQualityCheckReadFilter  
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter  
0 read(s) filtered by: GoodCigarReadFilter  
0 read(s) filtered by: WellformedReadFilter  
14977236 total reads filtered out of 14977236 reads processed

As a result, my VCF file contains only the header with no variant calls:

source=HaplotypeCaller  
CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TEST

I have already performed several troubleshooting steps. I checked the BAM file in IGV, and I can clearly see variants present at different loci. I also verified that my input directories and reference genome paths are correct. Additionally, I confirmed that the BAM file and reference genome are sorted properly. To cross-check my pipeline, I ran the exact same workflow on yeast samples, and it successfully detected variants. However, with human samples, no variants are being called.

What could be causing no variants to be detected when using human samples, even though the pipeline works with yeast data?

GATK Variant-calling WGS • 1.0k views

ADD COMMENT • link updated 8 months ago by Ram 45k • written 9 months ago by akshit ▴ 10

score 2 · Answer 1 · 2025-02-06

MAPQ=255 is "unknown mapq"

see https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

The default MAPQ=255 for the unique mappers maybe changed with --outSAMmapqUnique parameter (integer 0 to 255) to ensure compatibility with downstream tools such as GATK

use

--outSAMmapqUnique  60

to get something closer to BWA.

score 0 · Answer 2 · 2025-02-26

The best way of running variant calling on STAR aligned bam files is to process the bam file with GATK SplitNCigarReads tool. HaplotypeCaller will ignore reads that contain spliced CIGARs so before using it reads with N CIGARs will be split into primary and supplementary reads with proper flags and HaplotypeCaller will gladly use all of those primary and supplementary alignments of the same spliced read for variant calling.