Question: feature extraction from SNPs in R
0
gravatar for Uday Rangaswamy
15 months ago by
Indian Institute of Technology, Madras, India
Uday Rangaswamy120 wrote:

I have a list of SNPs that are associated with the Alzheimer's disease. I wanted to know, what are the possible features that could be extracted out of those SNPs to build a classification model in R. Please help.

snp sequence R gene • 508 views
ADD COMMENTlink modified 15 months ago • written 15 months ago by Uday Rangaswamy120

Do you have any survival data?

ADD REPLYlink written 15 months ago by cpad011212k

I've been told by my mentor that the SNPs with P value greater than 0.8 are survival data (or that it does not affect the gene expression for Alzheimer's) from the following data set : IGAP. How do I go about it from here?

ADD REPLYlink modified 15 months ago • written 15 months ago by Uday Rangaswamy120

I think you need to talk this through with your mentor. P value > 0.8 means absolutely nothing at all. In a previous thread you were also talking about this and you are probably talking about r^2 > 0.8, which is a measurement of correlation, probably linkage disequilibrium.

ADD REPLYlink written 15 months ago by WouterDeCoster42k

I'll sort that out with my mentor soon. Back to my primary question, what are the possible features that could be extracted out of those SNPs to build a classification model in R? Do I need a supervised data to start with?

ADD REPLYlink modified 15 months ago • written 15 months ago by Uday Rangaswamy120

First, you need a proper classification task, that means at least two classes to sort items into. You need to know what are your items and what are your classes. Then, you can decide what your features could be. You will need a training and test set, that means you need data associating class and item as well as additional data to use as feature vectors.

After that you can try out different classifiers, feature selection, etc.. Example:

  • Items: patients
  • Classes: affected, unaffected
  • Features: genotypes of SNPs for each patient
ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup47k

The input to my classification model should be an rs id and the model should look for a certain features (that I train it with) and accordingly decide whether the inputted rs id falls within the class of affecting the gene expression of Alzheimer's disease or no. Makes sense?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
3

Not really. possibly you are looking for something like eQTL analysis, but I think you should first force your supervisor to be more clear about the project setup, otherwise it's just a waste of time by guess work. Also, if you got p-values already, these are most likely already the best way to estimate association. I makes very limited sense to try to learn the p-value distribution by a classifier.

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup47k

That is exactly what I was told by my mentor regarding the P value. I was told to consider P value < 0.01 to be disease class and P value > 0.8 to be normal. I was told to use P value only to define class and not as a feature to build the classifier.

ADD REPLYlink written 15 months ago by Uday Rangaswamy120

Please consider that your mentor might have an inadequate grasp of the problem as well. As Ram and Wouter have said earlier, these p-value cutoffs are inadequate and especially the 0.8 limit is completely arbitrary.

ADD REPLYlink written 15 months ago by Michael Dondrup47k

Alright. So is it alright to use the SNPs mentioned in this Alzheimer's experiment paper as disease influencer class?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
1

That paper describes loci associated with Alzheimers in European genomes. Please make sure you're on top of all the caveats that go with taking that as a truth set. Any statement you make can only claim association, not causality. I say this because the word "influence" suggests causality.

ADD REPLYlink written 15 months ago by RamRS25k
1

You should also make sure you are familiar with things like linkage disequilibrium. SNPs reported in that paper most likely are not the functional variant underlying disease, e.g. the ABCA7 SNP is best explained by expansions of an intronic VNTR.

ADD REPLYlink written 15 months ago by WouterDeCoster42k

Alright, thanks for your knowledge all of you. Now, assuming that I have two classes of SNPs, please tell me are there any possible features that I can compute for them?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120

Can you first elaborate on your biological/genetic background? Do you know about things such as an association analysis, linkage disequilibrium and variant pathogenicity prediction?

ADD REPLYlink written 15 months ago by WouterDeCoster42k

No. My area of specialisation is machine learning. I'm mostly into solving problems that suffers from the curse of dimensionality. I've developed an interest to see the impact of machine learning in biological data. Hence I'm working on such projects. I'm very much open to learning. Are the above mentioned topics is what I've been looking for?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
4

Those topics, among others, are very relevant to your problem. I'd encourage you to first do some research/read some literature, rather than blindly trying to solve that problem by throwing some machine learning at it. How can you select features if you do not understand them? You are likely to get an incorrect model without careful feature selection.

ADD REPLYlink written 15 months ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1319 users visited in the last hour