Using Linear Regression on Genotype and Expression data
Hi all,

I have studied many sources like this and this that try to relate the gene expression of a gene to the variants(SNPs). but in all of them, I have a question that they didn't answer. My question is this: As we have 3 types of genotype ( "0" which refers to 0 minor allele count (ref/ref), "1" refers to 1 minor allele count (ref/alt) , "2" refers to 2 minor allele count (alt/alt) ) , and if we just considered SNPs within 100 Kbp upstream and downstream of TSS(Transcription factor site) we may have about ~20 SNPs for each gene, so there would become so colinearity between nonindependent variables( which is genotype).

this is a sample table that I will run Linear Regression ( function "lm" in R) :

            SNP1         SNP2           SNP3             SNP4    ...   Gene expression
donor1    0            1              0                1                 3.5
donor2    0            1              0                1                 4.5
donor3    0            0              0                0                 3.0
donor4    1            1              0                1                 5.5
donor5    0            1              0                1                 1.5
...


I have ~400 donors and many donors are like donor1 and donor5, their genotypes in SNPs are the same. so when I run linear regression this warning arise "prediction from a rank-deficient fit may be misleading"

so what should I do? Am I doing something wrong or no?

thanks alot

Can you show the model that you are fitting?

I am doing this :

model <- lm ( gene_expression ~ . , data = my_data_train)
pred_lm <- predict(model, newdata = my_data_test)

6 weeks ago
PeterKW

This is most likely a warning because you have some colinear covariates e.g. SNP2 and SNP 4 in the sample table you gave. There are various other reasons given here. I hope this will help, just give the different answers a good thought.