SNP representation (going beyond Plink or GenABEL)
1
0
Entering edit mode
9.4 years ago
alex.crimi • 0

HI

I'd like to use some data from the TGCA dataset (SNPs) , how can I merge the SNPs information with other data if I want to implement my own regression script (e.g. in R or Python). I cannot use Plink or the GenABEL library in R because I need to merge different types of data.

Namely, assume I have a set of SNPs (which are the basis A,T,G and C) and other variables which are numeric for the same samples. and I want to put them in unique feature vector. How can I represent the SNPs?

Would it make sense to convert the letters into numbers (e.g. A=1 , T=2...) ?

Or there are better ways to couple features vectors with SNPs and numeric data?

representation SNP • 2.5k views
ADD COMMENT
0
Entering edit mode

Thank you, my problem was about how to define the "Reference allele", I cannot use the most frequent allele in the dataset first because the selection of the population is biased (TGCA dataset for a tumor) and then because I think this is not generally a good idea.

Is there a way to know for a specific SNP what is the reference allele (assuming is not given in my data)?

If I look up into the NCBI page for a specific SNP, e.g. http://www.ncbi.nlm.nih.gov/snp/?term=Rs2308327

How can I know which one if the reference?

using this example: GCCCATGAAGGCCACCGGTTGGGGA[A/G]GCCAGGCTTGGGAGGGAGCTCAGGT

is "A" the reference?

Then I guess I can code as you said 0 If I have in my data twice same of the reference(AA), 1 if I have only half(AG or GA), 2 if I have completely twice different from the reference (GG), right?

ADD REPLY
0
Entering edit mode

Just download the reference genome and see what the appropriate base is. Yes, a coding scheme like that would be normal.

ADD REPLY
0
Entering edit mode
9.4 years ago

Typically, one treats the reference sequence as level 0 and then numbers the alternative alleles 1,2, etc. for the regression. If the data is phased, then you'd give haplotypes a phase as appropriate.

ADD COMMENT

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6