How To Convert Snp Genotype Data Into 0,1,2 Matrix
3
6
Entering edit mode
10.4 years ago
Fayue1015 ▴ 200

Hi, all Here propose the questions of how to convert the raw genotype data into a matrix filled with 0,1,2.

In the raw data, there are genotypes like CT CC CC CC CC CT CC TT CC CC CC NN note: (NN) is missing genotype.

1. in the data it is obvious that C is the major allele, and T is the minor allele. So CC is coded as 0, CT is 1,and TT 2.
2. However, the reference allele information from the array is A/G.
3. For missing NN genotype, I randomly assign 0,1,2 to it.
4. I convert the matrix for case and control separately.

questions:

1. when we translate genotype data into 0,1,2 matrix, we should consider "The" data and decide what is the major and minor allele, right? While thread http://biostar.stackexchange.com/questions/7134/snp-genotype-data mentions about using the reference genotype to decide 0,1,2. which one the correct?

2. Is it ok to randomly assign 0,1,2 to the missing genotype instead of imputing them?

3. Is it reasonable to convert the case and control data into a matrix separately?

After many tries, I get some SNP-SNP pairs with likelihood ratio test value very large like 3000.

Which is due to some very small cells in the contingency table. How to deal with such pairs, throw them aways?

genotyping • 29k views
0
Entering edit mode

note that when you convert the genotypes to 0,1,2 codes, you lose the phasing of the data. So, CT will be merged with TC. What about using a '9' for the missing genotypes?

0
Entering edit mode

Missing genotypes can be imputed advance to avoid un-expected error.

4
Entering edit mode
10.4 years ago

For point 1, it is your choice, really, to use dbSNP's major/ref allele or what you observe in your population. Is your population very similar to either the reference genome of a HapMap population? For example, one of our main populations is from Utah (USA) and is very much like the CEU HapMap population (I think some individuals were in both projects). Another of populations is from the Boston (USA) area but of southern European origin and so some variants have different MAFs (minor allele frequencies). In the Puerto Rican population we study, admixture has flipped some minor alleles to major. For the purpose of a matrix as you're putting together you can use your observations for the 0,1,2 assignments.

For point 2, no, it is not good to randomly assign a 0, 1 or 2 to an NN genotype, even if you were to weight that assignment based on observed genotypes. The NN observations are unknown genotype and must remain so or you will see false associations when you do the phenotype-genotype association analysis. We delete that individual for that SNP for any association tests. You can use LD to assign a genotype, but do so carefully!

For point 3, you can keep these as separate matrices or you can have one matirx with an added column to indicate case or control.

0
Entering edit mode

Actually I am using WTCCC data to analyze the association between genotype-phenotype. First, let's assume that the case and control study samples are from the same origin, UK. For point 3, what I mean is that if I analyze case and control separately and combined, the result is differently. For example, the same SNP, allele in case sample may have C/T while in control T/C, but if combined case and control together, it is C/T. So the corresponding (0,1,2) matrix is different. Do you have experience in dealing with this specific issue? Thanks.

0
Entering edit mode

Hi, also as for LD assign genotype, do you have any suggestion which tool to use

0
Entering edit mode

There are quite a lot of tools. I'd say IMPUTE (Jonathan Marchini's site) is not too bad as it automatically creates chunks of data to be imputed. You will need to flip the alleles onto the positive strand as it uses panel reference haplotypes. MACH (Abecasis) is doing OK as well as BEAGLE (Browning).

0
Entering edit mode

We used MACH with good success.

0
Entering edit mode

My colleague just adds that the preferred choice these days is IMPUTE, v2.

0
Entering edit mode

2
Entering edit mode
10.4 years ago
Genotepes ▴ 950

Hi

it really depends on what you need - want to do. And of the origine formatof the data.

If you are willing to analyse the association of SNPs with some trait under a "linear" model (in the sense that the risk increases - decreases - linearly with the number of alleles), then you can just do any transformation.

In plink you just need to do a --recodeA.

Obviously, the best think is to define one allele (for instance de reference) as the 0 when homozyguous. this way you can comapre results with results from other sources (if you have the same coding). The use the --recode-allele option

In general take a look at http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode (maybe you know).

For filling the missing data : this also depends on what you are trying to do. You might replace the NN by the mean of the dosage (21 + 80 + 1*2)/11 for your example). You can assign it randomly given your allele frequencies (so the prob distribution where you draw your data from is built form your allele frequencies). It is alos possible to do real genetic imputation based on the genotypes at the neighbour SNPs. If you need to have a plain genoty (0,1 2), then you can assign randomly based on genotype frequencies. You might want to do it several times to see the sensitivity of your procedure to this imputation.

So back to the initial sentence : this depends on what you would like to do.

Finally, in your example, you may need to flip the alleles.

0
Entering edit mode

Good points, but for missing data, we still prefer to use LD over random, even if based on observed allele frequencies. LD is the basis of imputation, which is used widely in GWAS.

0
Entering edit mode

Right Larry Just that sometimes (for eigenvectors for instance) we really prefer having all the data non-missing with some rapid "imputation" procedure. I think Eigenstrat is doing this kind of data filling. But definitely, genetic imputation based on linkage disequilibrium is the best choice (as long as you have enough LD - otherwise)

0
Entering edit mode

Actually I just use a logistic regression model. but I see some very large test value, which is 1000!!. Not sure need to throw away such test or not.

0
Entering edit mode

Again, what are you doing exactly ; regression on which outcome. I am asking it as you are talking about SNP-SNP value, which rather looks like interaction test and not simple association test.

0
Entering edit mode

Yes, I am doing epistasis analaysis.

0
Entering edit mode
3.5 years ago
Shicheng Guo ★ 9.0k

How about Generate_SSD_SetID function in SKAT R package?

library("SKAT")
SKAT.input<-Generate_SSD_SetID(File.Bed, File.Bim, File.Fam, File.SetID, File.SSD, File.Info)
SSD.INFO<-Open_SSD(File.SSD, File.Info)
SSD.INFO$nSample SSD.INFO$nSets
Result.SKAT<-SKAT.SSD.AllSSD.INFO, obj, kernel = "linear") ## note the use of the linear kernel


I guess maybe you need pay attention to Rvtests It includes a variety of association tests (e.g. single variant score test, burden test, variable threshold test, SKAT test, fast linear mixed model score test. What's more, it take VCF and plink as input file.

rvtest --inVcf input.vcf --pheno phenotype.ped --out output --geneFile refFlat_hg19.txt.gz --burden cmc --vt price --kernel skat,kbac