Question: How to evaluate a biomarker signature in an independent dataset
gravatar for JJ
11 months ago by
JJ440 wrote:

Hi all,

I have RNA-seq samples from two groups (responders / non-responders). I am interested in generating a predictive gene signature which can separate the two groups. Based on a previous post, I have now decided to use lasso-penalized regression or elastic net regression.

So, now I am looking to evaluate this signature.

  • First, I can do this with a training and test set.
  • Second, I would like to test these in independently generated datasets. RNA-seq datasets but also qPCR.

My question now is how do I do this? The first one is straightforward. Just split the data (80% for building a predictive model, 20% for evaluating the model) and then make prediction on test data. But how can I do this for an independently generated dataset? I cannot directly use the final model on the independent datasets I assume.

Thank you for your help/input!

rna-seq • 371 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by JJ440
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

You can certainly use the same model on the new data and make predictions on it - this is where the real testing of the work comes into play. It just requires the same variable names (here, gene names) and obviously your new data should be on the same scale and processed in the same way. I've done this for predicting ethnicity using SNPs and it is surprisingly 'good', in terms of sensitivity / specificity and ROC analysis.

My experience of using lasso-penalised regression is that it's not that great for identifying a definitive model. It can certainly help to reduce a large variable load to a more manageable number, like 50-100. One can then apply stepwise regression on the reduced dataset and further test a few final models for things like R2 shrinkage and through ROC analysis.

Note that lasso-panalised, elastic-net, and ridge regression merely differ based on the value of alpha:

The elastic-net penalty is controlled by (\alpha), and bridges the gap between lasso ((\alpha=1), the default) and ridge ((\alpha=0)). The tuning parameter (\lambda) controls the overall strength of the penalty.



I've generated some Powerpoint notes on model testing on new data on my GitHub page:


ADD COMMENTlink modified 11 months ago • written 11 months ago by Kevin Blighe46k

Thank you so much for your answer!

Does this still work well when you have different data types as well? Model build on RNA-seq and applied to qPCR?

Thank you for your input on how to perform the regression: So I will now use penalised regression, trying out different alphas and then apply stepwise regression if too many variables still remain. I have decided on 15 samples per group for the discovery/training set and 5 per group for the validation/test set. Thanks again!!!

ADD REPLYlink written 11 months ago by JJ440

A universal tenet of making predictions is that the degree to which they can be trusted is dependent upon how similar the underlying data is to that used to train/fit the model. Using a model fit on one data type to make predictions on a significantly different data type is going to lead to a world of headaches.

ADD REPLYlink written 11 months ago by Devon Ryan91k

Thanks for the input. But applying a model build on RNA-seq data to an independent RNA-seq dataset is generally accepted?

Is there anything you could suggest on how to translate such findings between data types?

ADD REPLYlink written 11 months ago by JJ440

Yes, that's acceptable since the model was built on similar data. If you start changing library protocols and such then the results will get less reliable, of course. I've never tried running models fit on RNAseq to qPCR data, so I don't know off-hand exactly what transformations would be best. Perhaps Kevin has done that, but I suspect you'll have to find some matched datasets and play around with the data to see what's reasonable.

ADD REPLYlink written 11 months ago by Devon Ryan91k

Yes, as alluded by Devon, performing the RNA-seq model predictions on qPCR data may not be valid. The general process would be this:

  1. Build model predictor from RNA-seq training data
  2. Perform model predictions on both the training and testing data from the same RNA-seq experiment
  3. Perform model predictions on independent RNA-seq experiments processed in the same way (optional)
  4. Put your final panel of genes to the test by independently re-performing differential analysis / model building, but, this time, using a targeted method, such as high-throughput qPCR, NanoString, etc., and usually on a higher number of samples.
  5. Further refine your model based on #4

In the past, what we did was take genes from RNA-seq that were differentially expressed and then tested these on NanoString data. We then only performed model building on NanoString data itself. We also did the same for RNA-seq and Fluidigm data. There is no real definitive way to do this, though.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe46k

Thank you so much for your input!

ADD REPLYlink written 11 months ago by JJ440
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1420 users visited in the last hour