Regression using genotype on expression of genes
1
0
Entering edit mode
3.1 years ago
wstla27 ▴ 20

I am a total beginner on bioinformatics, so this question might be very very trivial. Right now, I need to run a linear regression using genotype information on the expression of some genes. I have vcf files for all the chromosomes. I am having a hard time understanding how should I feed the genotype information (0s and 1s) to the regression model. Do I use the allele frequency or should I just use the 0s and 1s? Also, regarding the expression of genes, I have a list of the id of the genes, there related snp_ids, r-values, and p-values. In order to feed into the linear model, what kind of expression value should I use?

(I am having hard time understanding these because from all the stats courses, we just simply use values and numbers. But for the biology information, there are only 0s and 1s. I can't seem to figure out how to do a regression on 0s and 1s and find their association.)

Thank you so much for your helps!

SNP gene linear regression • 544 views
1
Entering edit mode
3.1 years ago

I am not sure what you are aiming to do, exactly. However, you should attempt to get your VCF data in an 'analysis-ready' format. This will involve summarising it to allele tallies (continuous) or maintaining it as categorical variables (for Ref, Heterozygous Alt, and Homozygous Alt).

After that, you can do a multinomial logistic regression or a linear regression:

glm(Variant ~ GeneExpression, data = mydata, family = binomial(link = 'logit')) # multinomial regression
lm(GeneExpression ~ Variant, data = mydata) # linear regression


Kevin