Hi, all Here propose the questions of how to convert the raw genotype data into a matrix filled with 0,1,2.
In the raw data, there are genotypes like CT CC CC CC CC CT CC TT CC CC CC NN note: (NN) is missing genotype.
- in the data it is obvious that C is the major allele, and T is the minor allele. So CC is coded as 0, CT is 1,and TT 2.
- However, the reference allele information from the array is A/G.
- For missing NN genotype, I randomly assign 0,1,2 to it.
- I convert the matrix for case and control separately.
questions:
when we translate genotype data into 0,1,2 matrix, we should consider "The" data and decide what is the major and minor allele, right? While thread http://biostar.stackexchange.com/questions/7134/snp-genotype-data mentions about using the reference genotype to decide 0,1,2. which one the correct?
Is it ok to randomly assign 0,1,2 to the missing genotype instead of imputing them?
- Is it reasonable to convert the case and control data into a matrix separately?
After many tries, I get some SNP-SNP pairs with likelihood ratio test value very large like 3000.
Which is due to some very small cells in the contingency table. How to deal with such pairs, throw them aways?
note that when you convert the genotypes to 0,1,2 codes, you lose the phasing of the data. So, CT will be merged with TC. What about using a '9' for the missing genotypes?
Missing genotypes can be imputed advance to avoid un-expected error.