Figure out genotype from VCF line
1
1
Entering edit mode
3.9 years ago
imbiling ▴ 10

I have generated a gVCF (genomic VCF) with GATK using this command line:

gatk HaplotypeCaller -R GRCh38.p13.genome.fa.gz -I bam.bam -D dbsnp_146.hg38.vcf.gz -O genome.g.vcf.gz -ERC GVCF --output-mode EMIT_ALL_ACTIVE_SITES

I'm using GRCh38 patch 13, my BAM file is also based on it.

I want to analyze my genotype for rs4633: https://www.snpedia.com/index.php/Rs4633

As I understand from SNPedia, the reference allele is "C". As I have two variants from my mother and my father, I can have either (C;C), (C;T) or (T;T)

In my gVCF I have the corresponding line:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bar
chr22   19962712        rs4633  C       T,<NON_REF>     1117.03 .       DB;DP=29;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=95148,29

The reference allele in chromosome 22 at position 19962712 is "C". But what exactly does "T,<non_ref>" mean in the ALT column? How should I interpret this? How can I extract the two alleles (father/mother) from this line and map them to the ones from SNPedia: (C;C), (C;T) or (T;T)?

To ask the question in a different way: how can I obtain both my alleles from the ALT column?

To add to this confusion. Another example. I want to know my genotype for rs1801131. SNPedia says: https://www.snpedia.com/index.php/rs1801131 Possible alleles: (A;A), (A;C), (C;C)

Why the "letters" are different than rs4633? Is it because of A=T and G=C?

The line in the gVCF is:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bar
chr1    11794419        rs1801131       T       G,<NON_REF>     1193.60 .       BaseQRankSum=0.306;DB;DP=54;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-4.833;RAW_MQandDP=151742,54;ReadPosRankSum=1.170 GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:19,35,0:54:99:0|1:11794400_G_A:1201,0,583,1258,688,1945:11794400:12,7,20,15

So reference allele "T" and "alternative" is "G,<non_ref>". So what are my two alleles which I can map to the possible alleles from SNPedia: (A;A), (A;C), (C;C)?

My assumption is that REF column is the allele from GRCh38 and ALT is somehow my own, but there should be two - one from mom and one from dad. What I'm doing wrong?

SNP genome vcf • 1.6k views
ADD COMMENT
3
Entering edit mode
3.9 years ago
rbagnall ★ 1.8k

Two things to help you:

But what exactly does "T,<\non_ref>" mean in the ALT column?

You have only made a g.vcf file with HaplotypeCaller. This is a file that is 'poised' to be genotyped, but that has not yet been performed. You need to use the g.vcf file as input in the genotyping step of GATK using the GenotypeGVCFs tool. This will give you the genotypes that you are expecting to see.

Why the "letters" are different than rs4633? Is it because of A=T and G=C?

Yes. the MTHFR gene is transcribed on the reverse strand of DNA and SNPedia is showing the alleles corresponding to the mRNA. In the reference sequence these alleles will be the complimentary alleles. from SNPedia:

rs1801131 is a SNP in the MTHFR gene, representing an A>C mutation at mRNA position 1298, resulting in a glu429-to-ala (E429A) substitution (hence this SNP is also known as A1298C or E429A).

ADD COMMENT
0
Entering edit mode

Thank you very much for this answer! I ran the GenotypeGVCFs tool and the result for rs4633 for example is:

chr22   19962712        rs4633  C       T       1117.03 .       AC=2;AF=1.00;AN=2;DB;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=57.28;QD=32.90;SOR=0.760        GT:AD:DP:GQ:PL   1/1:0,29:29:87:1131,87,0

So the reference allele is C and "mine" is T. The problem is that I still need two "mine" alleles - one from my mother and one from my father. Isn't this right? Otherwise, I can't map this "T" that I got from the conversion to the two alleles tuple described in SNPedia: (C;C), (C;T) or (T;T). What am I missing? Thanks in advance!

ADD REPLY
0
Entering edit mode

The last column shows the genotype

1/1:0,29:29:87:1131,87,0

the 1/1 means the genotype is T/T

a 0/0 would be C/C

a 0/1 would be C/T

the 0,29 means there were 0 reads with a C and 29 with a T

Have a look at the VCF specifications to understand the other columns

ADD REPLY
0
Entering edit mode

Thank you so much! This is exactly what I was missing! rbagnall and RamRS, I'm really grateful. This was the last confusion I was having. Now everything is clear. I've accepted the answer. Thank you, thank you so much!

ADD REPLY
0
Entering edit mode

Hi, thanks but this answer was all rbagnall. I merely edited their (and your) formatting.

ADD REPLY

Login before adding your answer.

Traffic: 2503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6