I am using the Trusight Tumor 170 kit from Illumina. I have sequenced 10 benign tumor samples and 10 malign tumos samples. I would like to create a classifier based on snps and indels (what Illumina calls small variants). I have already some snps and indels that belongs only to the malign sample set and other that belongs only to the benign sample set. Let's say I hava this dataset:
tumor_malign <- c(1,0,0,1...) snp1 <- c(0,1,1,0...) snp2 <- c(1,1,1,0...) snp3 <- c(0,0,1,1...) snpN <- c(0,1,1,1...)
I was thinking to do a logistic regression:
glm(tumor_malign ~ snp1 + snp2 + snp3 + ... + snpN, family=binomial(link=logit))
but I am not sure if this will be a good aproach, because I am thinking that I could do just some "if" do get a diagnose, for instance:
if snp1==1 and snp2==1...: has_tumor = T else: has_tumor = F
On the other hand, if I just base my methodology on variants (at the end, they are categorical variables), if a new sample comes and it doesn't have any of the variants I wouldn't give a diagnosis.
What would you think of this approach? is there any paper discribing something similar? would you suggest any different think?
Thanks a lot