Linear regression on RNA-seq data
2
0
Entering edit mode
2.2 years ago
JACKY ▴ 140

Hey, I have RNA-seq data from patients who were infected with a virus . The transcriptomic data is collected from day 0 (pefore infection) till day 14, two weeks after infection. of course I have RNA-seq expression matrix with samples in columns and genes in rows. I also have the day of each sample, is it taken from day 0, day 1 and so on...

My goal is to use linear regression to predict the day, according to the data I have. To make a model that the independant variables are the genes, and the outcome (dependat variable) is the day number.

Is it possible to use linear regression for this task ? if yes, can I get a hint of how to do it in R ? if not, what should I use then ? it needs to be supervised machine learning..

Thank you.

r regression • 1.8k views
ADD COMMENT
0
Entering edit mode

You should consult a statistician about your data. No one ML model will fit in all cases, and there are certain precautions you need to take when building a predictive model.

ADD REPLY
0
Entering edit mode

This is a small sample of the RNA-seq counts data:

structure(list(ERR3240275 = c(1L, 0L, 59L, 35L, 14L, 67L, 0L, 
7L, 46L, 90L, 7L, 27L, 240L, 15L, 0L, 0L, 117L, 124L, 28L, 41L
), ERR3240276 = c(3L, 0L, 66L, 33L, 15L, 73L, 0L, 5L, 48L, 64L, 
1L, 35L, 209L, 7L, 0L, 0L, 117L, 128L, 28L, 45L), ERR3240277 = c(1L, 
0L, 54L, 35L, 14L, 81L, 0L, 5L, 49L, 97L, 6L, 28L, 198L, 14L, 
0L, 1L, 112L, 120L, 27L, 43L), ERR3240278 = c(0L, 0L, 39L, 33L, 
11L, 69L, 0L, 6L, 52L, 77L, 10L, 43L, 244L, 15L, 0L, 0L, 127L, 
140L, 25L, 34L), ERR3240279 = c(0L, 0L, 56L, 21L, 15L, 71L, 0L, 
3L, 41L, 70L, 5L, 32L, 231L, 19L, 0L, 0L, 102L, 119L, 25L, 29L
), ERR3240280 = c(1L, 0L, 41L, 24L, 4L, 44L, 0L, 7L, 38L, 64L, 
1L, 20L, 140L, 8L, 1L, 0L, 72L, 90L, 26L, 29L), ERR3240281 = c(0L, 
0L, 56L, 34L, 5L, 72L, 0L, 8L, 49L, 81L, 9L, 28L, 231L, 10L, 
0L, 0L, 108L, 141L, 39L, 40L), ERR3240282 = c(3L, 0L, 67L, 25L, 
12L, 75L, 0L, 7L, 39L, 87L, 4L, 27L, 231L, 16L, 0L, 1L, 143L, 
142L, 24L, 42L), ERR3240283 = c(0L, 0L, 60L, 41L, 11L, 71L, 0L, 
11L, 50L, 96L, 6L, 40L, 247L, 10L, 0L, 1L, 128L, 127L, 27L, 29L
), ERR3240284 = c(3L, 0L, 47L, 31L, 9L, 86L, 0L, 11L, 57L, 100L, 
6L, 31L, 207L, 17L, 0L, 0L, 100L, 151L, 33L, 38L), ERR3240285 = c(0L, 
0L, 65L, 34L, 13L, 88L, 0L, 10L, 47L, 79L, 8L, 49L, 228L, 11L, 
0L, 0L, 129L, 160L, 41L, 51L), ERR3240286 = c(1L, 0L, 40L, 27L, 
14L, 70L, 0L, 14L, 53L, 90L, 3L, 40L, 238L, 14L, 0L, 0L, 118L, 
156L, 30L, 34L), ERR3240287 = c(1L, 0L, 67L, 35L, 11L, 85L, 0L, 
11L, 47L, 87L, 2L, 33L, 208L, 11L, 0L, 0L, 128L, 131L, 25L, 40L
), ERR3240288 = c(1L, 0L, 50L, 31L, 18L, 67L, 0L, 11L, 42L, 92L, 
6L, 33L, 224L, 15L, 0L, 0L, 149L, 131L, 30L, 36L), ERR3240289 = c(0L, 
0L, 42L, 34L, 11L, 56L, 0L, 14L, 38L, 98L, 9L, 10L, 205L, 13L, 
0L, 0L, 104L, 148L, 29L, 34L), ERR3240290 = c(2L, 0L, 64L, 37L, 
9L, 63L, 0L, 8L, 44L, 93L, 5L, 43L, 215L, 13L, 0L, 0L, 103L, 
152L, 24L, 24L), ERR3240291 = c(2L, 0L, 43L, 22L, 13L, 90L, 0L, 
19L, 25L, 52L, 1L, 39L, 95L, 11L, 0L, 0L, 125L, 68L, 25L, 18L
), ERR3240292 = c(0L, 0L, 49L, 20L, 20L, 117L, 0L, 20L, 19L, 
60L, 6L, 50L, 73L, 5L, 0L, 0L, 135L, 74L, 25L, 17L), ERR3240293 = c(0L, 
0L, 39L, 15L, 11L, 107L, 0L, 22L, 12L, 55L, 3L, 52L, 86L, 8L, 
0L, 2L, 146L, 64L, 33L, 15L), ERR3240294 = c(0L, 0L, 43L, 27L, 
19L, 78L, 1L, 21L, 25L, 51L, 3L, 51L, 56L, 10L, 0L, 1L, 136L, 
58L, 18L, 19L)), row.names = c("TSPAN6", "TNMD", "DPM1", "SCYL3", 
"C1orf112", "FGR", "CFH", "FUCA2", "GCLC", "NFYA", "STPG1", "NIPAL3", 
"LAS1L", "ENPP4", "SEMA3F", "CFTR", "ANKIB1", "CYP51A1", "KRIT1", 
"RAD52"), class = "data.frame")

The metadata that tells us the day of the sample:

structure(list(`Sample Characteristic[individual]` = c("Donor 1", 
"Donor 1", "Donor 1", "Donor 1", "Donor 1", "Donor 1", "Donor 1", 
"Donor 1", "Donor 1", "Donor 1", "Donor 1", "Donor 1", "Donor 1", 
"Donor 1", "Donor 1", "Donor 1", "Donor 1", "Donor 1", "Donor 1", 
"Donor 1"), `Sample Characteristic[organism]` = c("Homo sapiens", 
"Homo sapiens", "Homo sapiens", "Homo sapiens", "Homo sapiens", 
"Homo sapiens", "Homo sapiens", "Homo sapiens", "Homo sapiens", 
"Homo sapiens", "Homo sapiens", "Homo sapiens", "Homo sapiens", 
"Homo sapiens", "Homo sapiens", "Homo sapiens", "Homo sapiens", 
"Homo sapiens", "Homo sapiens", "Homo sapiens"), `Factor Value[time]` = c("1 day", 
"1 day", "1 day", "1 day", "1 day", "1 day", "1 day", "1 day", 
"1 day", "1 day", "1 day", "1 day", "1 day", "1 day", "1 day", 
"1 day", "14 day", "14 day", "14 day", "14 day")), row.names = c("ERR3240275", 
"ERR3240276", "ERR3240277", "ERR3240278", "ERR3240279", "ERR3240280", 
"ERR3240281", "ERR3240282", "ERR3240283", "ERR3240284", "ERR3240285", 
"ERR3240286", "ERR3240287", "ERR3240288", "ERR3240289", "ERR3240290", 
"ERR3240291", "ERR3240292", "ERR3240293", "ERR3240294"), class = "data.frame")
ADD REPLY
1
Entering edit mode
2.2 years ago
Mensur Dlakic ★ 27k

It appears that you need to perform feature selection and find a subset of informative genes. If so, you may want to try lasso regression, because the L1-penalty will shrink most of feature coefficients to zero, which will effectively select the features. Specifically, I suggest you do lasso regression with cross-validation.

the outcome (dependant variable) is the day number.

If I understand correctly that you want to fit the data to a number of days that have passed since the infection, that to me makes little sense. This is because I don't expect any trend in data that would be uniformly increasing with each passing day. But even if you found a set of genes that fit to the increasing number of days, what would be the conclusion from it?

ADD COMMENT
0
Entering edit mode

In my understanding the OP doesn't want to select genes. S/he wants to use global gene expression to predict an outcome - it's not clear to me what this outcome or response variable is though. Lasso may work but my gut feeling is that with many more genes than samples it would be better to use principal components.

ADD REPLY
0
Entering edit mode

Yes, you are right, I need to use the genes as independent variables, to predict the day of the sample (dependent variable). Each day has a different expression profile, so I'll use the genes expression for the predictions. Of course I cant use all genes so I'll have to pick the most informative ones, and for that your suggestion is great, I will use lasso regression. However, I can't use lasso for all genes, I'll have to pick a number of genes, then lasso will make it even smaller.

There is no coclusion, I'm doing this for a homework, it is for practice purposes only. The problem is, I don't know how will I apply this in R.

ADD REPLY
0
Entering edit mode
2.2 years ago

I just throw in an idea... Since you probably have many more genes than samples you cannot just regress using all genes as covariates. However, you could reduce the gene expression matrix to the first few first principal components and use those as covariates. How many PCs? You would have to look at the variance explained by each additional component and make a trade-off between number of covariates and amount of information retained. You would hope that with a handful of PCs or anyway much fewer PCs than samples you capture most of the variation. Perhaps some cross validation would help. Also, you may discard upfront genes with very low expression or very small variance across samples.

It's not very clear to me what your response variable is, but using the reduced gene expression matrix you should be able to fit a, I don't know, linear mixed model or time-to-event model.

Since a new observation comes as a vector of gene expressions, you should first project this vector onto the rotated axes from PCA above, get the same number of first components you decided above, and use those as new data for the fitted model to make the prediction.

You may get better answers if you post a small representative sample of your data for people to play with.


it needs to be supervised machine learning

That's an odd requirement... Why does it have to be under the heading of ML?!

ADD COMMENT
0
Entering edit mode

I understand that I can't use all the available genes, I must reduce their number first, but just to make sure I understood what you said, use PCA to identify the informative genes?

The machine learning is a requirement of the lecturer, this is a home project, there isn't much to conclude from such data.. so predicting the day using linear regression makes the mose sense to me.

ADD REPLY

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6