Question: How To Convert Snp Genotype Data Into 0,1,2 Matrix
gravatar for Fayue1015
8.6 years ago by
Fayue1015200 wrote:

Hi, all Here propose the questions of how to convert the raw genotype data into a matrix filled with 0,1,2.

In the raw data, there are genotypes like CT CC CC CC CC CT CC TT CC CC CC NN note: (NN) is missing genotype.

  1. in the data it is obvious that C is the major allele, and T is the minor allele. So CC is coded as 0, CT is 1,and TT 2.
  2. However, the reference allele information from the array is A/G.
  3. For missing NN genotype, I randomly assign 0,1,2 to it.
  4. I convert the matrix for case and control separately.


  1. when we translate genotype data into 0,1,2 matrix, we should consider "The" data and decide what is the major and minor allele, right? While thread mentions about using the reference genotype to decide 0,1,2. which one the correct?

  2. Is it ok to randomly assign 0,1,2 to the missing genotype instead of imputing them?

  3. Is it reasonable to convert the case and control data into a matrix separately?

After many tries, I get some SNP-SNP pairs with likelihood ratio test value very large like 3000.

Which is due to some very small cells in the contingency table. How to deal with such pairs, throw them aways?

genotyping • 24k views
ADD COMMENTlink modified 19 months ago by Shicheng Guo8.3k • written 8.6 years ago by Fayue1015200

note that when you convert the genotypes to 0,1,2 codes, you lose the phasing of the data. So, CT will be merged with TC. What about using a '9' for the missing genotypes?

ADD REPLYlink written 8.1 years ago by Giovanni M Dall'Olio27k

Missing genotypes can be imputed advance to avoid un-expected error.

ADD REPLYlink written 19 months ago by Shicheng Guo8.3k
gravatar for Larry_Parnell
8.6 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

For point 1, it is your choice, really, to use dbSNP's major/ref allele or what you observe in your population. Is your population very similar to either the reference genome of a HapMap population? For example, one of our main populations is from Utah (USA) and is very much like the CEU HapMap population (I think some individuals were in both projects). Another of populations is from the Boston (USA) area but of southern European origin and so some variants have different MAFs (minor allele frequencies). In the Puerto Rican population we study, admixture has flipped some minor alleles to major. For the purpose of a matrix as you're putting together you can use your observations for the 0,1,2 assignments.

For point 2, no, it is not good to randomly assign a 0, 1 or 2 to an NN genotype, even if you were to weight that assignment based on observed genotypes. The NN observations are unknown genotype and must remain so or you will see false associations when you do the phenotype-genotype association analysis. We delete that individual for that SNP for any association tests. You can use LD to assign a genotype, but do so carefully!

For point 3, you can keep these as separate matrices or you can have one matirx with an added column to indicate case or control.

ADD COMMENTlink written 8.6 years ago by Larry_Parnell16k

Actually I am using WTCCC data to analyze the association between genotype-phenotype. First, let's assume that the case and control study samples are from the same origin, UK. For point 3, what I mean is that if I analyze case and control separately and combined, the result is differently. For example, the same SNP, allele in case sample may have C/T while in control T/C, but if combined case and control together, it is C/T. So the corresponding (0,1,2) matrix is different. Do you have experience in dealing with this specific issue? Thanks.

ADD REPLYlink written 8.6 years ago by Fayue1015200

Hi, also as for LD assign genotype, do you have any suggestion which tool to use

ADD REPLYlink written 8.6 years ago by Fayue1015200

There are quite a lot of tools. I'd say IMPUTE (Jonathan Marchini's site) is not too bad as it automatically creates chunks of data to be imputed. You will need to flip the alleles onto the positive strand as it uses panel reference haplotypes. MACH (Abecasis) is doing OK as well as BEAGLE (Browning).

ADD REPLYlink written 8.6 years ago by Genotepes950

We used MACH with good success.

ADD REPLYlink written 8.6 years ago by Larry_Parnell16k

My colleague just adds that the preferred choice these days is IMPUTE, v2.

ADD REPLYlink written 8.6 years ago by Larry_Parnell16k

Thanks Larry, you make me very clear about this specific issue.

ADD REPLYlink written 8.6 years ago by Fayue1015200
gravatar for Genotepes
8.6 years ago by
Nantes (France)
Genotepes950 wrote:


it really depends on what you need - want to do. And of the origine formatof the data.

If you are willing to analyse the association of SNPs with some trait under a "linear" model (in the sense that the risk increases - decreases - linearly with the number of alleles), then you can just do any transformation.

In plink you just need to do a --recodeA.

Obviously, the best think is to define one allele (for instance de reference) as the 0 when homozyguous. this way you can comapre results with results from other sources (if you have the same coding). The use the --recode-allele option

In general take a look at (maybe you know).

For filling the missing data : this also depends on what you are trying to do. You might replace the NN by the mean of the dosage (21 + 80 + 1*2)/11 for your example). You can assign it randomly given your allele frequencies (so the prob distribution where you draw your data from is built form your allele frequencies). It is alos possible to do real genetic imputation based on the genotypes at the neighbour SNPs. If you need to have a plain genoty (0,1 2), then you can assign randomly based on genotype frequencies. You might want to do it several times to see the sensitivity of your procedure to this imputation.

So back to the initial sentence : this depends on what you would like to do.

Finally, in your example, you may need to flip the alleles.

ADD COMMENTlink written 8.6 years ago by Genotepes950

Good points, but for missing data, we still prefer to use LD over random, even if based on observed allele frequencies. LD is the basis of imputation, which is used widely in GWAS.

ADD REPLYlink written 8.6 years ago by Larry_Parnell16k

Right Larry Just that sometimes (for eigenvectors for instance) we really prefer having all the data non-missing with some rapid "imputation" procedure. I think Eigenstrat is doing this kind of data filling. But definitely, genetic imputation based on linkage disequilibrium is the best choice (as long as you have enough LD - otherwise)

ADD REPLYlink written 8.6 years ago by Genotepes950

Actually I just use a logistic regression model. but I see some very large test value, which is 1000!!. Not sure need to throw away such test or not.

ADD REPLYlink written 8.6 years ago by Fayue1015200

Again, what are you doing exactly ; regression on which outcome. I am asking it as you are talking about SNP-SNP value, which rather looks like interaction test and not simple association test.

ADD REPLYlink written 8.6 years ago by Genotepes950

Yes, I am doing epistasis analaysis.

ADD REPLYlink written 8.6 years ago by Fayue1015200
gravatar for Shicheng Guo
19 months ago by
Shicheng Guo8.3k
Shicheng Guo8.3k wrote:

How about Generate_SSD_SetID function in SKAT R package?

SKAT.input<-Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)
SSD.INFO<-Open_SSD(File.SSD, File.Info)
Result.SKAT<-SKAT.SSD.AllSSD.INFO, obj, kernel = "linear") ## note the use of the linear kernel

I guess maybe you need pay attention to Rvtests It includes a variety of association tests (e.g. single variant score test, burden test, variable threshold test, SKAT test, fast linear mixed model score test. What's more, it take VCF and plink as input file.

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --geneFile refFlat_hg19.txt.gz --burden cmc --vt price --kernel skat,kbac
ADD COMMENTlink modified 19 months ago • written 19 months ago by Shicheng Guo8.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1551 users visited in the last hour