Question

feature extraction from SNPs in R

0

Entering edit mode

6.7 years ago

bioinfo456 ▴ 150

I have a list of SNPs that are associated with the Alzheimer's disease. I wanted to know, what are the possible features that could be extracted out of those SNPs to build a classification model in R. Please help.

R gene sequence SNP • 2.8k views

ADD COMMENT • link 6.7 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Do you have any survival data?

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

I've been told by my mentor that the SNPs with P value greater than 0.8 are survival data (or that it does not affect the gene expression for Alzheimer's) from the following data set : IGAP. How do I go about it from here?

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

0

Entering edit mode

I think you need to talk this through with your mentor. P value > 0.8 means absolutely nothing at all. In a previous thread you were also talking about this and you are probably talking about r^2 > 0.8, which is a measurement of correlation, probably linkage disequilibrium.

ADD REPLY • link 6.7 years ago by WouterDeCoster 48k

0

Entering edit mode

I'll sort that out with my mentor soon. Back to my primary question, what are the possible features that could be extracted out of those SNPs to build a classification model in R? Do I need a supervised data to start with?

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

0

Entering edit mode

First, you need a proper classification task, that means at least two classes to sort items into. You need to know what are your items and what are your classes. Then, you can decide what your features could be. You will need a training and test set, that means you need data associating class and item as well as additional data to use as feature vectors.

After that you can try out different classifiers, feature selection, etc.. Example:

Items: patients
Classes: affected, unaffected
Features: genotypes of SNPs for each patient

ADD REPLY • link 6.7 years ago by Michael 56k

0

Entering edit mode

The input to my classification model should be an rs id and the model should look for a certain features (that I train it with) and accordingly decide whether the inputted rs id falls within the class of affecting the gene expression of Alzheimer's disease or no. Makes sense?

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

3

Entering edit mode

Not really. possibly you are looking for something like eQTL analysis, but I think you should first force your supervisor to be more clear about the project setup, otherwise it's just a waste of time by guess work. Also, if you got p-values already, these are most likely already the best way to estimate association. I makes very limited sense to try to learn the p-value distribution by a classifier.

ADD REPLY • link 6.7 years ago by Michael 56k

0

Entering edit mode

That is exactly what I was told by my mentor regarding the P value. I was told to consider P value < 0.01 to be disease class and P value > 0.8 to be normal. I was told to use P value only to define class and not as a feature to build the classifier.

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Please consider that your mentor might have an inadequate grasp of the problem as well. As Ram and Wouter have said earlier, these p-value cutoffs are inadequate and especially the 0.8 limit is completely arbitrary.

ADD REPLY • link 6.7 years ago by Michael 56k

0

Entering edit mode

Alright. So is it alright to use the SNPs mentioned in this Alzheimer's experiment paper as disease influencer class?

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

1

Entering edit mode

That paper describes loci associated with Alzheimers in European genomes. Please make sure you're on top of all the caveats that go with taking that as a truth set. Any statement you make can only claim association, not causality. I say this because the word "influence" suggests causality.

ADD REPLY • link 6.7 years ago by Ram 45k

1

Entering edit mode

You should also make sure you are familiar with things like linkage disequilibrium. SNPs reported in that paper most likely are not the functional variant underlying disease, e.g. the ABCA7 SNP is best explained by expansions of an intronic VNTR.

ADD REPLY • link 6.7 years ago by WouterDeCoster 48k

0

Entering edit mode

Alright, thanks for your knowledge all of you. Now, assuming that I have two classes of SNPs, please tell me are there any possible features that I can compute for them?

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Can you first elaborate on your biological/genetic background? Do you know about things such as an association analysis, linkage disequilibrium and variant pathogenicity prediction?

ADD REPLY • link 6.7 years ago by WouterDeCoster 48k

0

Entering edit mode

No. My area of specialisation is machine learning. I'm mostly into solving problems that suffers from the curse of dimensionality. I've developed an interest to see the impact of machine learning in biological data. Hence I'm working on such projects. I'm very much open to learning. Are the above mentioned topics is what I've been looking for?

ADD REPLY • link 6.7 years ago by bioinfo456 ▴ 150

4

Entering edit mode

Those topics, among others, are very relevant to your problem. I'd encourage you to first do some research/read some literature, rather than blindly trying to solve that problem by throwing some machine learning at it. How can you select features if you do not understand them? You are likely to get an incorrect model without careful feature selection.

ADD REPLY • link 6.7 years ago by WouterDeCoster 48k