Hello,

I am using the Trusight Tumor 170 kit from Illumina. I have sequenced 10 benign tumor samples and 10 malign tumos samples. I would like to create a classifier based on snps and indels (what Illumina calls small variants). I have already some snps and indels that belongs only to the malign sample set and other that belongs only to the benign sample set. Let's say I hava this dataset:

tumor_malign <- c(1,0,0,1...) snp1 <- c(0,1,1,0...) snp2 <- c(1,1,1,0...) snp3 <- c(0,0,1,1...) snpN <- c(0,1,1,1...)

I was thinking to do a logistic regression:

glm(tumor_malign ~ snp1 + snp2 + snp3 + ... + snpN, family=binomial(link=logit))

but I am not sure if this will be a good aproach, because I am thinking that I could do just some "if" do get a diagnose, for instance:

if snp1==1 and snp2==1...: has_tumor = T else: has_tumor = F

On the other hand, if I just base my methodology on variants (at the end, they are categorical variables), if a new sample comes and it doesn't have any of the variants I wouldn't give a diagnosis.

What would you think of this approach? is there any paper discribing something similar? would you suggest any different think?

Thanks a lot

Well first of all welcome to machine learning field. So for your research, you need to build a model (which can be logistic regression, random forest, support vector machine or naive bayes, etc..) and find the best classifier which separates benign and malign tumor more accurately. Once you build the model, you just need to predict new sample's class. But you should now how machine learning works, you can google it, there are enormous documents, videos, tutorials. I really like this nature review. After you learn the basics of machine learning, you can also read this.