Question

SNP representation (going beyond Plink or GenABEL)

0

Entering edit mode

9.4 years ago

alex.crimi • 0

HI

I'd like to use some data from the TGCA dataset (SNPs) , how can I merge the SNPs information with other data if I want to implement my own regression script (e.g. in R or Python). I cannot use Plink or the GenABEL library in R because I need to merge different types of data.

Namely, assume I have a set of SNPs (which are the basis A,T,G and C) and other variables which are numeric for the same samples. and I want to put them in unique feature vector. How can I represent the SNPs?

Would it make sense to convert the letters into numbers (e.g. A=1 , T=2...) ?

Or there are better ways to couple features vectors with SNPs and numeric data?

representation SNP • 2.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by alex.crimi • 0

0

Entering edit mode

Thank you, my problem was about how to define the "Reference allele", I cannot use the most frequent allele in the dataset first because the selection of the population is biased (TGCA dataset for a tumor) and then because I think this is not generally a good idea.

Is there a way to know for a specific SNP what is the reference allele (assuming is not given in my data)?

If I look up into the NCBI page for a specific SNP, e.g. http://www.ncbi.nlm.nih.gov/snp/?term=Rs2308327

How can I know which one if the reference?

using this example: GCCCATGAAGGCCACCGGTTGGGGA[A/G]GCCAGGCTTGGGAGGGAGCTCAGGT

is "A" the reference?

Then I guess I can code as you said 0 If I have in my data twice same of the reference(AA), 1 if I have only half(AG or GA), 2 if I have completely twice different from the reference (GG), right?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by alex.crimi • 0

0

Entering edit mode

Just download the reference genome and see what the appropriate base is. Yes, a coding scheme like that would be normal.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

score 0 · Answer 1 · 2014-11-30

0

Entering edit mode

9.4 years ago

Devon Ryan 104k

Typically, one treats the reference sequence as level 0 and then numbers the alternative alleles 1,2, etc. for the regression. If the data is phased, then you'd give haplotypes a phase as appropriate.

ADD COMMENT • link 9.4 years ago by Devon Ryan 104k