HI
I'd like to use some data from the TGCA dataset (SNPs) , how can I merge the SNPs information with other data if I want to implement my own regression script (e.g. in R or Python). I cannot use Plink or the GenABEL library in R because I need to merge different types of data.
Namely, assume I have a set of SNPs (which are the basis A,T,G and C) and other variables which are numeric for the same samples. and I want to put them in unique feature vector. How can I represent the SNPs?
Would it make sense to convert the letters into numbers (e.g. A=1 , T=2...) ?
Or there are better ways to couple features vectors with SNPs and numeric data?
Thank you, my problem was about how to define the "Reference allele", I cannot use the most frequent allele in the dataset first because the selection of the population is biased (TGCA dataset for a tumor) and then because I think this is not generally a good idea.
Is there a way to know for a specific SNP what is the reference allele (assuming is not given in my data)?
If I look up into the NCBI page for a specific SNP, e.g. http://www.ncbi.nlm.nih.gov/snp/?term=Rs2308327
How can I know which one if the reference?
using this example:
GCCCATGAAGGCCACCGGTTGGGGA[A/G]GCCAGGCTTGGGAGGGAGCTCAGGT
is "A" the reference?
Then I guess I can code as you said 0 If I have in my data twice same of the reference(AA), 1 if I have only half(AG or GA), 2 if I have completely twice different from the reference (GG), right?
Just download the reference genome and see what the appropriate base is. Yes, a coding scheme like that would be normal.