Hi,
I've been using HaplotypeCaller and GenotypeGVCFs (version 4.2) in all-sites mode to generate VCF files including invariant sites. I've just discovered, however, that in some positions - as far as I can tell only at invariant sites - the ref allele is longer than 1 bp, but still directly followed by another record, for which the ref allele is the second base in the first record's ref allele, and so on.
A bit badly explained, but I'm attaching a screenshot of an example here. It seems to be consistent and only at invariant sites, so I suppose it would be safe to just cut down these long ref alleles to the first base, but I would like to understand why it happens, in case I'm missing something obvious here.
For the most I suppose this wouldn't have a downstream impact, but I stumbled upon this when I was converting from vcf to sequence alignments, which will of course be affected by this behaviour.
Anyone knows what could be causing this, and if there's any meaning of it?
Thanks!
Axel!
Thanks for answering! Indeed I'm working with a diploid organism, but I don't see how this could be the explanation here. The reference genome is a haploid representation so that in itself can't have a long deletion. And no alts are present, which I suppose would be the case if any of the samples had a deletion?
yes
hum... no.
GATK use the BAM an tries to make a consensus for a diploid organisme. So if you have 50% reads having a REF allele and 50% reads having an ALT allele, your genotype with be REF/ALT (!= haploid representation)
But now you're talking about the genotypes - not the reference genome - which are all 0/0 across the data set. This means that, when converting the above region to an alignment, we would introduce artificial duplications at all these positions with ref alleles longer than one nt.
OK, so I don't understand your question.
The issue here is the reference alleles that span across multiple genomic positions, and thus overlap each other. If you look at position 123651 above, for example. The reference allele on that position is GGCTCTGC. Clearly, this is not a single position, but 8. Therefore, I would not expect it to be directly followed by a new reference allele at position 123652, but it is. Furthermore, if you look at positions 123652-123658, the reference alleles are G,C,T,C,T,G,C. That is, the exact same bases as position 2-8 in the reference allele at 123651. All samples are 0/0 at these positions. This means that, if we were to write out the genotypes as represented in this vcf region, the GCTCTGC would be (artificically, most likely) dubplicated.
Hope this made it clearer, thanks for taking the time to answer! I made a new comment below with a link to the gatk forum, as it seems that this may be a bug.