Question: Figure out genotype from VCF line
1
gravatar for imbiling
4 months ago by
imbiling10
imbiling10 wrote:

I have generated a gVCF (genomic VCF) with GATK using this command line:

gatk HaplotypeCaller -R GRCh38.p13.genome.fa.gz -I bam.bam -D dbsnp_146.hg38.vcf.gz -O genome.g.vcf.gz -ERC GVCF --output-mode EMIT_ALL_ACTIVE_SITES

I'm using GRCh38 patch 13, my BAM file is also based on it.

I want to analyze my genotype for rs4633: https://www.snpedia.com/index.php/Rs4633

As I understand from SNPedia, the reference allele is "C". As I have two variants from my mother and my father, I can have either (C;C), (C;T) or (T;T)

In my gVCF I have the corresponding line:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bar
chr22   19962712        rs4633  C       T,<NON_REF>     1117.03 .       DB;DP=29;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=95148,29

The reference allele in chromosome 22 at position 19962712 is "C". But what exactly does "T,<non_ref>" mean in the ALT column? How should I interpret this? How can I extract the two alleles (father/mother) from this line and map them to the ones from SNPedia: (C;C), (C;T) or (T;T)?

To ask the question in a different way: how can I obtain both my alleles from the ALT column?

To add to this confusion. Another example. I want to know my genotype for rs1801131. SNPedia says: https://www.snpedia.com/index.php/rs1801131 Possible alleles: (A;A), (A;C), (C;C)

Why the "letters" are different than rs4633? Is it because of A=T and G=C?

The line in the gVCF is:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bar
chr1    11794419        rs1801131       T       G,<NON_REF>     1193.60 .       BaseQRankSum=0.306;DB;DP=54;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-4.833;RAW_MQandDP=151742,54;ReadPosRankSum=1.170 GT:AD:DP:GQ:PGT:PID:PL:PS:SB    0|1:19,35,0:54:99:0|1:11794400_G_A:1201,0,583,1258,688,1945:11794400:12,7,20,15

So reference allele "T" and "alternative" is "G,<non_ref>". So what are my two alleles which I can map to the possible alleles from SNPedia: (A;A), (A;C), (C;C)?

My assumption is that REF column is the allele from GRCh38 and ALT is somehow my own, but there should be two - one from mom and one from dad. What I'm doing wrong?

snp vcf genome • 280 views
ADD COMMENTlink modified 4 months ago by zx87549.6k • written 4 months ago by imbiling10
3
gravatar for rbagnall
4 months ago by
rbagnall1.7k
Australia
rbagnall1.7k wrote:

Two things to help you:

But what exactly does "T,<\non_ref>" mean in the ALT column?

You have only made a g.vcf file with HaplotypeCaller. This is a file that is 'poised' to be genotyped, but that has not yet been performed. You need to use the g.vcf file as input in the genotyping step of GATK using the GenotypeGVCFs tool. This will give you the genotypes that you are expecting to see.

Why the "letters" are different than rs4633? Is it because of A=T and G=C?

Yes. the MTHFR gene is transcribed on the reverse strand of DNA and SNPedia is showing the alleles corresponding to the mRNA. In the reference sequence these alleles will be the complimentary alleles. from SNPedia:

rs1801131 is a SNP in the MTHFR gene, representing an A>C mutation at mRNA position 1298, resulting in a glu429-to-ala (E429A) substitution (hence this SNP is also known as A1298C or E429A).

ADD COMMENTlink written 4 months ago by rbagnall1.7k

Thank you very much for this answer! I ran the GenotypeGVCFs tool and the result for rs4633 for example is:

chr22   19962712        rs4633  C       T       1117.03 .       AC=2;AF=1.00;AN=2;DB;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=57.28;QD=32.90;SOR=0.760        GT:AD:DP:GQ:PL   1/1:0,29:29:87:1131,87,0

So the reference allele is C and "mine" is T. The problem is that I still need two "mine" alleles - one from my mother and one from my father. Isn't this right? Otherwise, I can't map this "T" that I got from the conversion to the two alleles tuple described in SNPedia: (C;C), (C;T) or (T;T). What am I missing? Thanks in advance!

ADD REPLYlink modified 4 months ago by RamRS30k • written 4 months ago by imbiling10

The last column shows the genotype

1/1:0,29:29:87:1131,87,0

the 1/1 means the genotype is T/T

a 0/0 would be C/C

a 0/1 would be C/T

the 0,29 means there were 0 reads with a C and 29 with a T

Have a look at the VCF specifications to understand the other columns

ADD REPLYlink modified 4 months ago by RamRS30k • written 4 months ago by rbagnall1.7k

Thank you so much! This is exactly what I was missing! rbagnall and RamRS, I'm really grateful. This was the last confusion I was having. Now everything is clear. I've accepted the answer. Thank you, thank you so much!

ADD REPLYlink modified 4 months ago • written 4 months ago by imbiling10

Hi, thanks but this answer was all rbagnall. I merely edited their (and your) formatting.

ADD REPLYlink written 4 months ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1254 users visited in the last hour