Fewer sites in vcf than in reference genome despite Emit_All_Sites argument
0
0
Entering edit mode
3 months ago
shpak.max ▴ 50

I have been running GATK's UnifiedGenotyper with the argument Emit_All_Sites, e.g.

Unified Genotyper -out_mode Emit_All_Sites -R my_ref.fasta -mbq 10 -stand_call_conf 31 -stand_emit_conf 31 -ploidy 2 -I genome1.bam -o genome_sites.vcf

It is my understanding that the number of sites per chromosome arm in in the vcf should precisely match the number of sites for the same chromosome arm in the reference genome. Instead, I find that the number of sites in the vcf is consistently smaller, often by hundreds of kb.

I'm trying to understand the possible reasons for this - it shouldn't be due to ambiguous 'N' sites in the reference genome, as these should still be returned as placeholder lines in the vcf. Similarly, more or less restrictive settings for stand_emit_conf should also give blank placeholders rather than skipping lines when evidence at a site is poor.

Is there anything that I'm missing?

GATK UnifiedGenotyper • 390 views
ADD COMMENT
0
Entering edit mode

I can't speak to UnifiedGenotyper, as it has been many years since GATK stopped supporting it, and I have not used it in a long time. But it in the documentation for HaplotypeCaller, this is what is said about the Emit_All_Sites flag:

EMIT_ALL_SITES produces calls at any callable site regardless of confidence

Not all sites are callable (e.g, sites where no reads mapped), so I would not expect the VCF to have as many entries as there are positions in the genome, even with this setting.

ADD REPLY
0
Entering edit mode

If that were the case, I would expect there not to be any uncalled sites in the vcf, yet there are a large number of sites with "." placeholders for uncalled sites, included at the chromosome ends where the reference is N. So I'm trying to figure out why some uncallable sites are returned as placeholder lines while others are excluded.

Is there some argument that forces GATK to return a blank line for all uncalled sites?

ADD REPLY
0
Entering edit mode

I believe UnifiedGenotyper is a locus walker that sets emitEmptySites() = False. As such you will not see entries for positions that have no reads. You might be able to get around this by passing -L or -V which covers all sites in the reference.

ADD REPLY
0
Entering edit mode

Could you please elaborate on what you mean by passing -L or -V, i.e. what would the appropriate argument be for -L with UnifiedGenotyper for the desired result of returning all sites?

ADD REPLY

Login before adding your answer.

Traffic: 1501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6