What is the correct way of setting the genotype after splitting multi-allelic sites in a VCF file?
1
0
Entering edit mode
6 weeks ago
terdon ▴ 410

Consider this minimal VCF file which contains a multi-allelic call made by a somatic variant caller on a human tumor sample:

##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  foo
chr10   43613843        .       GC      TC,CC,TT        .       clustered_events;multiallelic   DP=5180 GT:AD:AF:DP:F1R2:F2R1:SB        0/1/2/3:0,5154,11,15:0.994,0.00229,0.00306:5180:0,2507,4,8:0,2576,6,5:0,0,2585,2595


So, we have a site with three ALT alleles and a genotype of 0/1/2/3.

The VCF specification has this to say about the genotype field:

GT (String): Genotype, encoded as allele values separated by either of/or|. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.

As I understand the VCF specification, given the ALT field TC,CC,TT, the genotype of 0/1/2/3 is telling me that the reference sequence is absent but there are heterozygous calls for each of the three variants in ALT. This makes sense because this is a tumor sample, so presumably what we have here is three populations of cells, some with the GC=>TC change, some with GC=>CC and some with GC=>TT. Given the very high support for the GC=>TC change, I would assume this is actually a germline variant and there are small populations of tumor cells in my sample with either the GC=>CC or the GC=>TT variant. Alternatively, and quite likely, the latter two variants are sequencing artefacts.

Regardless, I need to split these into one variant per line. As I understand the specification, after splitting, I should be left with three variants and these specific genotypes (I am showing only the genotype here, for simplicity):

chr10   43613843   .   GC    TC   GT 1/0
chr10   43613843   .   GC    CC   GT 1/0
chr10   43613843   .   GC    TT  GT 1/0


Is this actually correct? Unfortunately, standard tools like bcftools produce something different:

$bcftools norm -m -any foo.vcf 2>/dev/null | grep chr10 | awk '{print$1,$2,$3,$4,$5,$9,$10}'
chr10 43613843 . GC TC GT:AD:AF:DP:F1R2:F2R1:SB 0/1/0/0:0,5154:0.994:5180:0,2507:0,2576:0,0,2585,2595
chr10 43613843 . GC CC GT:AD:AF:DP:F1R2:F2R1:SB 0/0/1/0:0,11:0.00229:5180:0,4:0,6:0,0,2585,2595
chr10 43613843 . GC TT GT:AD:AF:DP:F1R2:F2R1:SB 0/0/0/1:0,15:0.00306:5180:0,8:0,5:0,0,2585,2595


Here, we have sites with a single ALT variant but which have multiple genotypes and that doesn't make sense to me. How can we have a 0/0/0/1 genotype and only one ALT allele? Is bcftools assuming this is a triploid genome? Is this a bug in bcftools or am I misunderstanding the VCF specification? What is the correct way to display the genotype after splitting this kind of multi-allelic call?

=====

This question has also been asked on Bioinformatics Stack Exchange.

genotype bcftools vcf • 245 views
0
Entering edit mode
6 weeks ago

It seems to assume tetraploidy at this position, yes, as inferred via 0/1/2/3. Otherwise, we would see a GT of, e.g., 2/3, 2/2, etc; So, it splits this correctly, from what I can see.

The mere presence of 0/1/2/3 is going to be related to the upstream variant caller, whatever that was

Kevin