What is the correct way of setting the genotype after splitting multi-allelic sites in a VCF file?
1
0
Entering edit mode
2.9 years ago
terdon ▴ 430

Consider this minimal VCF file which contains a multi-allelic call made by a somatic variant caller on a human tumor sample:

##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  foo
chr10   43613843        .       GC      TC,CC,TT        .       clustered_events;multiallelic   DP=5180 GT:AD:AF:DP:F1R2:F2R1:SB        0/1/2/3:0,5154,11,15:0.994,0.00229,0.00306:5180:0,2507,4,8:0,2576,6,5:0,0,2585,2595

So, we have a site with three ALT alleles and a genotype of 0/1/2/3.

The VCF specification has this to say about the genotype field:

GT (String): Genotype, encoded as allele values separated by either of/or|. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.

As I understand the VCF specification, given the ALT field TC,CC,TT, the genotype of 0/1/2/3 is telling me that the reference sequence is absent but there are heterozygous calls for each of the three variants in ALT. This makes sense because this is a tumor sample, so presumably what we have here is three populations of cells, some with the GC=>TC change, some with GC=>CC and some with GC=>TT. Given the very high support for the GC=>TC change, I would assume this is actually a germline variant and there are small populations of tumor cells in my sample with either the GC=>CC or the GC=>TT variant. Alternatively, and quite likely, the latter two variants are sequencing artefacts.

Regardless, I need to split these into one variant per line. As I understand the specification, after splitting, I should be left with three variants and these specific genotypes (I am showing only the genotype here, for simplicity):

chr10   43613843   .   GC    TC   GT 1/0
chr10   43613843   .   GC    CC   GT 1/0
chr10   43613843   .   GC    TT  GT 1/0

Is this actually correct? Unfortunately, standard tools like bcftools produce something different:

$ bcftools norm -m -any foo.vcf 2>/dev/null | grep chr10 | awk '{print $1,$2,$3,$4,$5,$9,$10}'
chr10 43613843 . GC TC GT:AD:AF:DP:F1R2:F2R1:SB 0/1/0/0:0,5154:0.994:5180:0,2507:0,2576:0,0,2585,2595
chr10 43613843 . GC CC GT:AD:AF:DP:F1R2:F2R1:SB 0/0/1/0:0,11:0.00229:5180:0,4:0,6:0,0,2585,2595
chr10 43613843 . GC TT GT:AD:AF:DP:F1R2:F2R1:SB 0/0/0/1:0,15:0.00306:5180:0,8:0,5:0,0,2585,2595

Here, we have sites with a single ALT variant but which have multiple genotypes and that doesn't make sense to me. How can we have a 0/0/0/1 genotype and only one ALT allele? Is bcftools assuming this is a triploid genome? Is this a bug in bcftools or am I misunderstanding the VCF specification? What is the correct way to display the genotype after splitting this kind of multi-allelic call?

=====

This question has also been asked on Bioinformatics Stack Exchange.

genotype bcftools vcf • 1.2k views
ADD COMMENT
1
Entering edit mode
2.8 years ago

It seems to assume tetraploidy at this position, yes, as inferred via 0/1/2/3. Otherwise, we would see a GT of, e.g., 2/3, 2/2, etc; So, it splits this correctly, from what I can see.

The mere presence of 0/1/2/3 is going to be related to the upstream variant caller, whatever that was

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 2845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6