Question: HaplotypeCaller with --dbsnp does not populate ID column
6.2 years ago
I want to obtain a VCF file containing genotype calls and their scores for every rsID, whether or not a variant was called. I was planning to use the following steps:

  1. HaplotypeCaller -genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY  --emitRefConfidence BP_RESOLUTION as shown above
  2. awk '{ if ( $3 != "." ) { print $0; } }' variants.vcf > variants.filtered.vcf
  3. GenotypeGVCFs --includeNonVariantSites

Using the most recent dbSNP download here:, I ran this:

GATK -T HaplotypeCaller --reference_sequence GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --input_file recalibrated.bam --dbsnp current_dbsnp/All.vcf.gz --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY --emitRefConfidence BP_RESOLUTION --out variants.vcf

However, the ID column only contains ".". How can I get HaplotypeCaller to populate the ID column with rsIDs? Also, is there a better way to get variant and non-variant genotype calls with HaplotypeCaller?


6.2 years ago
It looks your reference genome from Ensembl (GRCh38) which used 1-based coordinate system. And the dbSNP file you have used is from NCBI, which uses 0-based coordinate system just like UCSC.

It might be because of that you are not able to find any variants belonging to dbSNP and the id's only show "." ?

That's concerning, then. I thought VCF was always 1-based.

However, I don't think that's the issue, since, with BP_RESOLUTION, literally every position is called (chr1:1, chr1:2, chr1:3, ...).

