Question: GTEx genotype data
gravatar for yzzhong1993
4.0 years ago by
yzzhong19930 wrote:

Hi, I am working on the GTEx data and download this file


I want to get the genotype for specific samples and make the format of the genotype be (0,1,2). I found the GT of this vcf has so many missing values.

My question is how to deal with these missing values when convert to genotype matrix? Is there any easier way to extract the genotype data of GTEx

1 30923 1_30923_G_T_b37 G T . PASS EXP_FREQ_A1=0.742;IMPINFO=0.435;CERTAINTY=0.847;TYPE=0;MISS=0.8067;HW=0.24 GT:GL:DS .|.:0.006,0.483,0.510:1.504 .|.:0.047,0.365,0.589:1.542 0|1:0.013,0.960,0.028:1.015 .|.:0.064,0.487,0.449:1.385 1|1:0.000,0.041,0.959:1.959 .|.:0.612,0.379,0.010:0.398 .|.:0.002,0.149,0.850:1.848 .|.:0.007,0.485,0.508:1.501 .|.:0.031,0.289,0.681:1.650 .|.:0.003,0.207,0.789:1.786 .|.:0.007,0.488,0.504:1.497 .|.:0.009,0.217,0.774:1.765 .|.:0.003,0.260,0.736:1.733 .|.:0.252,0.508,0.240:0.988 .|.:0.276,0.508,0.217:0.941 .|.:0.084,0.:|1:0.065,0.904,0.031:0.966 0|0:0.993,0.007,0.000:0.007 .|.:0.488,0.445,0.066:0.578 0|1:0.046,0.947,0.008:0.962 .|.:0.733,0.265,0.003:0.270 .|.:0.806,0.192,0.002:0.196 .|.:0.637,0.357,0.007:0.370 .|.:0.693,0.303,0.003:0.310 .|.:0.014,0.397,0.588:1.574 .|.:0.743,0.245,0.012:0.269 0|0:0.958,0.041,0.000:0.042 .|.:0.870,0.128,0.002:0.132 0|0:0.958,0.041,0.000:0.042 .|.:0.611,0.354,0.036:0.425 .|.:0.760,0.226,0.014:0.254 .|.:0.843,0.156,0.002:0.159

snp • 1.6k views
ADD COMMENTlink modified 3.9 years ago by Biostar ♦♦ 20 • written 4.0 years ago by yzzhong19930

You could first make plink files of the vcf file and then extract or remove the missing values. One downside of this that you gonna miss the bi-allelic variants...

ADD REPLYlink written 4.0 years ago by Floris Brenk940

For someone who is not familiar with GTEx data this question is totally unclear. could you please specify:

  • where is the information about the genotype in your file?
  • what you define as genotype matrix?

I think your problem can be easily solved by a grep/sed by the way. Also, beware of posting sensitive (aka, human) data on the internet, if you got them by specific access to via user login to dbGaP portal. In this case, you can produce an input file looking like the one you would like to analyze, but with no real data (for your own security).

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by dovah30

You can filter out low quality SNPs. That might be able to remove most case with missing genotypes. The remaining you can mark as missing data (NA or -1 say). Most statistical analysis you would do with the resulting data after should take the missing data into account. (For example if I were to do eQTL analysis, I would use MatrixEQTL to carry out linear regression based analysis which handles missing data.)

ADD REPLYlink written 3.8 years ago by vakul.mohanty240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1261 users visited in the last hour