What's the powerful biological methods for significant genes selection?
0
0
Entering edit mode
6.0 years ago
Chaimaa ▴ 260

Hello guys,

I'm looking for a more powerful method than the classical method elastic-net that can select a more significant list of genes.

I already used the elastic-net method with alpha=0.5 to select a set of significant genes from a genomic data but now I'm looking for a more powerful method that can do the same thing, as this is suggested by one reviewer. I read some papers but still can't be decided.

I appreciate any help!

gene selection significant biological method R • 2.9k views
ADD COMMENT
1
Entering edit mode

Just be careful on the wording...

'significant list of genes' should be 'statistically significant list of genes'. Also, in which way are they statistically significant? - what is the null hypothesis?; how will you test the null hypothesis and derive a p value?

How do you define 'more powerful method'?

I would take the variables from the elastic-net model and then test them independently in a standard regression model. Ultimately, I would build a final predictor model of the best variables and test it via ROC analysis, like I do here: A: How to exclude some of breast cancer subtypes just by looking at gene expressio

ADD REPLY
0
Entering edit mode

@Kevin Blighe Yes, Kevin right i should say statistically significant and since i didn't use a p-value that would be less than 0.05 metric.

So, i can only say specific genes in my case. I first applied the elastic-net, then a cutoff of 20 to determine these specific genes but one of the reviewers suggest me to use another method, bcz elastic net is an old and classical method.

I mean by a More powerful method a little bit new and outperform method that can do the same issue.

ADD REPLY
0
Entering edit mode

Yeh, but, the latest methods (assuming s/he means AI, machine learning, etc) are invariably not better than the 'classical' methods. I would take issue with the reviewer's comments. I have already given you a general workflow that you could try.

ADD REPLY
0
Entering edit mode

@Kevin Blighe Kevin Bro, in my case I have 2 variables X (matrix of size mn) and Y( vector of size m1) and those were my entries in the elastic net, and now i would like to find another method to do the same job .with these entries too.

I check some papers and still can't decide.

ADD REPLY
0
Entering edit mode

@Kevin Blighe How if i can use elastic-net with glmnet pachakge is this can make sense but i don't have 'subtype' as you mention here' "A: How to exclude some of breast cancer subtypes just by looking at gene expressio" I have a vector of vaues 0 and1 instead is it okk?

ADD REPLY
1
Entering edit mode

Yes, you just need 0 or 1 for the outcome variable. Please take a look at all of the parameters for the functions in the glmnet package

ADD REPLY
0
Entering edit mode

@Kevin Blighe ,Hi Kevin Last question Plz, do you recommend any other methods than elastic-net and lasso ?

ADD REPLY
1
Entering edit mode

You could try Random Forest.

ADD REPLY
0
Entering edit mode

sure , Thanks Kevin!

ADD REPLY
0
Entering edit mode

@Kevin Blighe, I again have some other questions, and i hope you don't mind it! I performed my analysis using Matlab glmnet package, and now i turned to follow your process mentioned here "A: Multinomial elastic net implementation on microarray dataset" First of all, my data of 2 labels matrix X(219*25172 and vector Y(`219*), 219 samples and 25172 genes and Y have only 2 values 0 or 1. I first try to open the X and y files into R Y opened properly but Not X using this command:

require(data.table)
       X<-fread("pathological_data.txt",sep="\t", stringsAsFactors=FALSE, header=TRUE)

But it has shown an object of size (218*25175) instead of (219*25172)!!

And on matab, i found around 303 genes, how can i use glm or lm in this case and with binomial or gaussian families in cvglmnet function with alpha=0.5 in my case?

what's the meaning of these 2things in your code ?

 family=binomial(link="logit")
Terms=c(2:4)

My data(X) looks like this

        ELMO2   RPS11   PNMA1   MMP2
sample1         -0.73275    0.89175 -1.59775    -1.6905
sample2      -1.358083333   1.381625    0.24475 -0.837333333
sample3        -0.584   1.09    -1.027  0.147
sample4         0.689   0.952625    -2.223  0.150166667
sample5    -0.795083333 1.06425 -1.15475    -0.015166667
sample6        -1.241916667 1.753125    -1.7555 -1.2375
ADD REPLY
1
Entering edit mode

Hey, both methods should not be expected to produce the same results. The likely reason is that there is different filtering between the R and MATLAB versions.

family=binomial(link="logit") - this instructs glm() that the model is a binomial logistic regression

Terms=c(2:4) - this is used with the Wald test to produce a Wald p-value using 1 or more terms combined. For example:

glm(y ~ x1 + x2 + x3 + x4 +x5)

Terms=c(2:4) will test x1 + x2 + x3 (2nd to 4th terms in the model, with the intercept being the 1st term) against the y variable.

ADD REPLY
0
Entering edit mode

@Kevin Blighe Yes Kevin that's why i turn to use yr R code now but I'm not unable to read my data into R plz check my line to open the file X<-fread("pathological_data.txt",sep="\t", stringsAsFactors=FALSE, header=TRUE)

In case i found also 300 genes by using R, can i still use glm() to pick the best predictors, bcz when we have more than 30 genes we have to test them separately and so how we can test 300 genes separately?

if we have 0 or 1 in Y label means its binomial right?

Thank for your valuable suggestions!

ADD REPLY
0
Entering edit mode

Did you look at the contents of X? - did you try to use read.table() instead of fread()?

Yes, 0 and 1 indicates binomial logistic regression.

With 300, the idea is to first use glmnet to reduce this to a lower number, and then use stepwise regression. You could also test each of the 300 gene's separately and choose only the genes that have p<0.05. Take a look at my function:, just released on Bioconductor: 3.1 Perform the most basic logistic regression analysis

ADD REPLY
0
Entering edit mode

@Kevin Blighe Sorry for the multiple questions Kevin I tried both read.table() and fread() but both of them can(t fully read my large data. They only read the first 16384 columns. you know my data is too large and excel can show only 16384 columns instead of 25172 columns. So in MATLAB, i used importdata but i don't know which command in R can read full data.?

No, my original genes are 25172 genes and after glment usage i get 300 genes

I really want to try yr code is clear and complete and can make my biological interpretation more significant.

ADD REPLY
1
Entering edit mode

Sorry, didn't read the all the comments but fread should read just fine, see example with 30000 columns:

# make big file
x <- matrix(sample(1:10, 90000, replace = TRUE), ncol = 30000)
write.table(x, "bigMat.txt", row.names = FALSE)

dt <- data.table::fread("bigMat.txt")
dim(dt)
# [1]     3 30000
identical(dim(x), dim(dt))
# [1] TRUE
all(dt[, 1:10] == x[, 1:10])
# [1] TRUE
ADD REPLY
0
Entering edit mode

@zx8754 Thanks zx8754, but could you plz explain these lines like the sample(1:10), 90000?

My data is of 219*25172 I'm not much familiar with R; Then i can try it.

ADD REPLY
1
Entering edit mode

I am just creating example data, similar to what you have. To show that fread can read files with 30000 columns.

Please share your example data, so we can reproduce your problem.

ADD REPLY
0
Entering edit mode

@ zx8754 Hi here are some rows and columns from my data

 ELMO2   RPS11   PNMA1   MMP2
    sample1         -0.73275    0.89175 -1.59775    -1.6905
    sample2      -1.358083333   1.381625    0.24475 -0.837333333
    sample3        -0.584   1.09    -1.027  0.147
    sample4         0.689   0.952625    -2.223  0.150166667
    sample5    -0.795083333 1.06425 -1.15475    -0.015166667
    sample6        -1.241916667 1.753125    -1.7555 -1.2375
ADD REPLY
0
Entering edit mode

To clarify, provide example data so that fread would fail to read.

ADD REPLY
0
Entering edit mode

@zx8754 what do you mean by example data, I shared some rows and columns among the 219 rows and 25172 columns? i tried your code but doesn't work

ADD REPLY
0
Entering edit mode

If we can't reproduce your problem, it is hard to guess the problem and the solutions. With your supplied example data fread works fine, we need example data where fread fails.

ADD REPLY
0
Entering edit mode

No, my original genes are 25172 genes and after glment usage i get 300 genes

So, extract these from your file and then input to R?

ADD REPLY
0
Entering edit mode

@Kevin Blighe But I want to try glmnet in R to extract genes from the 25172. Those 300 genes i got by using MATLAB.

ADD REPLY
0
Entering edit mode

Why do you want to repeat it in R?

ADD REPLY
0
Entering edit mode

bcz, I want to apply your next steps of glm and wald test which require as inputs X and Y. Those steps, i honestly don't know to do in MATLAB

ADD REPLY
0
Entering edit mode
  1. Take the 300 genes identified by glmnet in MATLAB
  2. in BASH / Shell, filter these genes out of your data (use AWK)
  3. Read the smaller dataset into R

Another option is to use R on a cluster, where the larger datatypes may be supported - these nuances are not my area of expertise.

Another option is to transpose the data in BASH / Shell, and then read the transposed data into R

ADD REPLY
0
Entering edit mode

@Kevin Blighe Great Kevin! Thanks a lot! One more thing I was working on a small project concerning cancer evolution with CNA and clinical data over 4 pathological stages and I'm planning to prepare a manuscript after finalizing the biological interpretations at each stage and i wonder if you can be with us as a co-author cz the majority of code sources for CNA was from your posts here ? Good Luck!

ADD REPLY
1
Entering edit mode

Sure thing. You can contact me from GitHub

ADD REPLY

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6