duplicated (ish) records in GATK vcf file
2
0
Entering edit mode
3.0 years ago
axejen ▴ 10

Hi,

I've been using HaplotypeCaller and GenotypeGVCFs (version 4.2) in all-sites mode to generate VCF files including invariant sites. I've just discovered, however, that in some positions - as far as I can tell only at invariant sites - the ref allele is longer than 1 bp, but still directly followed by another record, for which the ref allele is the second base in the first record's ref allele, and so on.

A bit badly explained, but I'm attaching a screenshot of an example here. It seems to be consistent and only at invariant sites, so I suppose it would be safe to just cut down these long ref alleles to the first base, but I would like to understand why it happens, in case I'm missing something obvious here.

For the most I suppose this wouldn't have a downstream impact, but I stumbled upon this when I was converting from vcf to sequence alignments, which will of course be affected by this behaviour.

Anyone knows what could be causing this, and if there's any meaning of it?

Thanks!

Axel! Example of a genomic stretch where the same reference allele is represented across multiple genomic positions

vcf gatk • 2.1k views
ADD COMMENT
1
Entering edit mode
3.0 years ago

the ref allele is longer than 1 bp, but still directly followed by another record, for which the ref allele is the second base in the first record's ref allele, and so on.

it's not a problem if your organism is diploid and one strand has a big deletion, or it can be a genotyping arcefact. Just look at the interval with IGV ?

ADD COMMENT
0
Entering edit mode

Thanks for answering! Indeed I'm working with a diploid organism, but I don't see how this could be the explanation here. The reference genome is a haploid representation so that in itself can't have a long deletion. And no alts are present, which I suppose would be the case if any of the samples had a deletion?

ADD REPLY
0
Entering edit mode

The reference genome is a haploid representation

yes

so that in itself can't have a long deletion.

hum... no.

ADD REPLY
0
Entering edit mode

GATK use the BAM an tries to make a consensus for a diploid organisme. So if you have 50% reads having a REF allele and 50% reads having an ALT allele, your genotype with be REF/ALT (!= haploid representation)

ADD REPLY
0
Entering edit mode

But now you're talking about the genotypes - not the reference genome - which are all 0/0 across the data set. This means that, when converting the above region to an alignment, we would introduce artificial duplications at all these positions with ref alleles longer than one nt.

ADD REPLY
0
Entering edit mode

OK, so I don't understand your question.

ADD REPLY
0
Entering edit mode

The issue here is the reference alleles that span across multiple genomic positions, and thus overlap each other. If you look at position 123651 above, for example. The reference allele on that position is GGCTCTGC. Clearly, this is not a single position, but 8. Therefore, I would not expect it to be directly followed by a new reference allele at position 123652, but it is. Furthermore, if you look at positions 123652-123658, the reference alleles are G,C,T,C,T,G,C. That is, the exact same bases as position 2-8 in the reference allele at 123651. All samples are 0/0 at these positions. This means that, if we were to write out the genotypes as represented in this vcf region, the GCTCTGC would be (artificically, most likely) dubplicated.

Hope this made it clearer, thanks for taking the time to answer! I made a new comment below with a link to the gatk forum, as it seems that this may be a bug.

ADD REPLY
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 4104 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6