Multinomial elastic net implementation on microarray dataset , error at lognet and nfold explanation
1
1
Entering edit mode
6.6 years ago
arronar ▴ 280

It's my first time trying to implement elastic net algorithm on microarray data and I'm facing some issues. My dataset DATA is a data frame that is consisted by 72 observations (rows) and 18703 predictors (columns). Observations are stored as the 18704th column in the data frame and are in replicates like the following.

DATA["classes"] = c( rep("A",10),
                       rep("B",13),
                       rep("C",3),
                       rep("D",10),
                       rep("E",10),
                       rep("F",10),
                       rep("G",10),
                       rep("H",3),
                       rep("I",3))
The way I'm trying to execute the elastic net is the following
# Random Sampling with 70-30% for training and validation respectively 
  y = z = 0
  while(y != 9 || z != 9){
   sample = sample(x = 1:nrow(DATA) , size = 0.7 * nrow(DATA) )

    train = DATA[sample,]
    test = DATA[-sample,]

    y = length(unique(train$classes))
    z = length(unique(test$classes))
  }

 # Execute the glmnet CV

 y = as.factor(train$classes)

 cvfit = cv.glmnet(train[,-18704], y, family="multinomial", alpha=0.9)

The thing now is that I'm getting this error that I cannot understand what it actually means.

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  (list) object cannot be coerced to type 'double'
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground

Is this because the replications of each observation are not even (some have 10 and others 3 ) or is it something different ? Also,while reading the documentation of the cv.glmnet I couldn't understand what the nfolds argument stands for. So if anyone can explain this it would be very helpful.

Thanks in advance.

R microarray elastic net multinomial regression • 6.6k views
ADD COMMENT
3
Entering edit mode
6.6 years ago

Hey arronar,

The nfolds parameter is key to the cv.glmnet function. If we take the default value of 10, i.e., 10-fold cross validation, your data-set will be divided 'randomly' into 10 equal chunks. If you have 100 samples, each chunk will contain 10 samples. During the cross validation process, 9/10 of these chunks are used as training datasets, i.e., the elastic net model is applied to each, whilst the final chunk is used as the validation dataset. If you have consistent data and good model assumptions, the results from the training and validation should be highly similar. This process is then repeated 10 times, each time breaking your data into 10 random chunks and repeating the modelling on training and validation datasets. Obviously, the maximum value that you can have for nfolds is the total number of samples in your dataset. If you set nfolds to your total number of samples, then you're essentially doing what's called 'leave-one-out' cross validation.

If you arrive at consistent results at the end, you can have confidence that your model assumptions are correct. If you end up with highly variable results from cross validation, then something is wrong with your data and/or your model assumptions.

Cross validation is different from bootstrapping, which involves taking a random chunk of your data, performing a series of commands/transformations, and then repeating upward of 250x to compare how consistent your results are. Similar idea but different implementation.

Obviously alpha is important, too:

  • 1.0 is considered the lasso regression

  • 0.0 is considered the ridge regression

  • ...anything in between is elastic regression

In terms of the problems running the cv.glmnet function, x has to be a data matrix and y, as far as I recall, has to be integers (not factorised). Apart from avoiding converting your y into a factor, you could try:

cvfit = cv.glmnet(data.matrix(train[,-18704]), y, family="multinomial", alpha=0.9)

It also appears that you don't need to divide your dataset into training and validation datasets yourself, as the cross-validation does that for you, and repeats it 10x.

ADD COMMENT
0
Entering edit mode

Thank you very much for the explanation. Very informative and clear. I'm going to try it out.

Although I don't understand the following you said :

If you set nfolds to your total number of samples, then you're essentially doing what's called 'leave-one-out' cross validation

If you create chunks equal to the total number of your samples for your training set (e.g 10), which sample will be the one that is left out ? I mean you have 10 samples and you use all of them to train the algorithm. Why we call it leave-one-out ?

ADD REPLY
1
Entering edit mode

Hi arronar, when you choose nfolds as the total number of samples, which as I mentioned is 'leave-one-out' cross validation, your dataset is 'randomly' divided into chunks, each chunk containing a single sample. If you have 100 samples, 99 will be tested separately, on their own, as part of the training data, and just 1 sample will be tested as the validation. This is then repeated 100 times, each time a different configuration of training and validation samples are used.

Some people don't like this type of cross validation specifically because only 1 sample is used as the validation. Most people do 3-fold.

Your next question may be about how the program 'randomly' chooses the samples for training and validation. There are specific circuits built into a computer system that allow it to pick what I would call a 'virtually random' number. R, which is primarily written in C and Fortran programming language, will probably use built in functions of those languages in order to pick a 'random' number, which is then used to break up your datasets. However, the key point is that you can control how R breaks up the dataset by setting a fixed 'seed' value with the function set.seed(), e.g., set.seed(123456), prior to running cv.glmnet. Then, each time you run the function with the chosen seed, the dataset will be divided in the same way.

ADD REPLY
1
Entering edit mode

Thank you once more. You are very instructive.

ADD REPLY
1
Entering edit mode

Absolutely no problem. Anymore questions, just ask :)

ADD REPLY
0
Entering edit mode

So.Once I've found you, let me ask one last thing about weights. Is it reasonable to set weights to your samples if you don't know their difference beforehand ? So in my case, I can not tell if sample A is a bit more important than the sample H and thus I cannot set specific weight values for them. Am I right or I missing something ?

Also even with the changes that I've done, algorithm seems not to work properly.

I changed the classes into numbers and used nfolds = 3

data.T["classes"] = c( rep(1,10),
                       rep(2,13),
                       rep(3,3),
                       rep(4,10),
                       rep(5,10),
                       rep(6,10),
                       rep(7,10),
                       rep(8,3),
                       rep(9,3))


# Elastic net implementation

library(glmnet)

y = as.numeric(data.T$classes)

cvfit = cv.glmnet(data.matrix(data.T[,-18704]), y, family="multinomial", nfolds = 3, alpha=0.9)

I tried with many different nfolds values but still getting the same error :S

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  one multinomial or binomial class has 1 or 0 observations; not allowed
In addition: Warning messages:
1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
2: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
ADD REPLY
1
Entering edit mode

Hey again arronar,

Just regarding the new error that you're getting. It at least allows us to infer that the data formats are okay, because each has passed the basic checks of the function. Also, I have read up on the behaviour of the function and the y variable is fine as either numeric or factor, whilst x does need to be converted to a data-matrix.

The error one multinomial or binomial class has 1 or 0 observations; not allowed seems to be related to your factor level 13, of which there is only 1 sample. If you remove that factor level, then it may run. However, there are additional warning messages that relate to the fact that all of your factor levels have low numbers.

Is this just training data, like, for practice, or is it for a project that could be published? If you really do just have 9 samples, then it will be difficult to extrapolate meaningful stats from this. As an example, the last time that I ran a lasso regression, I had ~450 samples and ~500 variables. The lasso regression has also been used in breast cancer to identify mutation signatures, on datasets of >1000 samples and >100,000 mutations.

To answer your question on the weights, I think that the weights actually relate to the variables and not the samples. As a practical example, if you were looking at gene expression in a specific tissue, you could assign as weights to the genes the level of specificity of expression of each gene in the tissue of interest. Some genes are expressed at very high levels in some tissues, and not at all in others.

Hope that this helps some more!

ADD REPLY
0
Entering edit mode

The error one multinomial or binomial class has 1 or 0 observations; not allowed seems to be related to your factor level 13, of which there is only 1 sample

Hmm something is not well understood either by you or by me. I don't think that I have a factor level 13 with only one sample. The y variable is consisted from the following:

factor level 1 * 10 samples and thus rep(1,10)
factor level 2 * 13 samples and thus rep(2,13)
factor level 3 * 3 samples and thus rep(3,3)
factor level 4 * 10 samples and thus rep(4,10)
factor level 5 * 10 samples and thus rep(5,10)
factor level 6 * 10 samples and thus rep(6,10)
factor level 7 * 10 samples and thus rep(7,10)
factor level 8 * 3 samples and thus rep(8,3)
factor level 9 * 3 samples and thus rep(9,3)

If you really do just have 9 samples, then it will be difficult to extrapolate meaningful stats from this

So to answer to you, my total samples with their replications are 62 and not 9. Also as you can see there is no factor that has only one sample.

I'm just playing with these data for now. I don't know what future will bring.

I got tired to thank you for your informative answers and thus, now I call you to treat you a beer :-)

ADD REPLY
1
Entering edit mode

Apologies, I misinterpreted your code regarding the y variable! However, is it 62 samples or 72? Your y vector contains 72 elements. Also, the factor levels with just 3 samples are still going to throw (produce) the warning message - you may have to remove them in order to obtain reliable results from the cross-validation.

Finally, the error message may relate to how your variables are too 'flat' (lack variation) in some of the factor levels, e.g., for the factor levels with just 3 samples. In your data-matrix, are there many missing or 0 values?; or are there any values that are the same across all samples (or even just all samples within a factor level)?

If you run the hist() function on your data-matrix, does the distribution look normal?

Sure you can send the beer to Ireland!

ADD REPLY
1
Entering edit mode

Hey!

You were right. After delete the samples with the small replication number, algorithm run as expected. Also I tried to create and a matrix with the most significant variables from the sparse matrix that returned from the coef() function. Take a look and tell your opinion.

# Elastic net implementation

library(glmnet)

# Remove the samples with only three replications (Rem3, Das , DasRem)
data.T = data.T[which(row.names(data.T) != c(31,32,33,81,82,83,91,92,93) ),]

set.seed(123)
y = as.numeric(data.T$classes)

cvfit = cv.glmnet(data.matrix(data.T[,-18704]), y, family="multinomial", nfolds = 9, alpha=0.9)
plot(cvfit)

# Extract most important variables.
myCoefficients = coef(cvfit, s=cvfit$lambda.min)

for( l in myCoefficients){
  genes = c(genes , dimnames(l)[[1]][which(l != 0)] )
  coefs = c( coefs , l[which(l != 0)] )
}
coefm = matrix(c(genes,coefs),nrow=length(genes))
ADD REPLY
1
Entering edit mode

Fantastic - really happy to have helped out.

ADD REPLY
0
Entering edit mode

The procedure of the significant genes is right isn't it ?

I'm happy too :-). At eleven tomorrow morning you are going to receive a package with beers from DHL at your office :-)

ADD REPLY
3
Entering edit mode

Yes, I believe so: you have to look for the variables whose coefficients have not been 'shrunk' to 0.

After I ran my Elastic model, I performed the following code:

idealLambda <- cv.ElasticModel$lambda.min
idealLambda1se <- cv.ElasticModel$lambda.1se
print(idealLambda); print(idealLambda1se)
co <- coef(cv.ElasticModel, s=idealLambda, exact=TRUE)
#co <- coef(cv.ElasticModel, s=idealLambda1se, exact=TRUE)
co[order(co, decreasing=TRUE),]
co[order(co, decreasing=FALSE),]

I then look at the coefficients and pick the best variable predictors. After that, I build a GLM with the best predictors:

finalElastic <- glm(CaseControl ~ VAR1 + VAR2 + VAR3 + VAR4 + VAR5 + VAR6 + VAR7, data=MyDF), family=binomial(link="logit"))
summary(finalElastic)

I then perform Wald's test to get a single P value for the model:

library(aod)
wald.test(b=coef(finalElastic), Sigma=vcov(finalElastic), Terms=c(2:8))

...and then I produce a ROC curve:

pdf("ROC.pdf")
    library(pROC)

    roc <- roc(na.omit(MyDF$CaseControl), fitted(finalElastic), smooth=FALSE)
    plot.roc(roc, add=FALSE, grid=TRUE, grid.lwd=2, col="black", main="My Elastic model")
    text(0.3, 0.15, col="black", paste("Elastic net-penalized regression:\nAUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.7)
dev.off()

Great! I will be expecting some lovely Austrian beer tomorrow!

ADD REPLY
1
Entering edit mode

Thank you very very much!

Have a nice rest of the day!

ADD REPLY

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6