Question: Is it fine to run SVM on RNA-seq read counts?
gravatar for fernardo
2.0 years ago by
fernardo 130
fernardo 130 wrote:


I have 930 samples of RNA-seq for 4 conditions. I am using RSEM processed for the RNA-seq data. Now I am doing the following steps:

1- Picking 75 genes of interest for the 930 samples.

2- Importing such table into SVM to classify those 4 conditions based on those genes. (NOTE: 75% training set, 25% test set)

3- Result: 100% true position. NOTE: even if I decrease the number of features (genes) from 75 to 25, it gives the same result.

Does any know this problem? can SVM be used for multiple classifications on such data?

Data content (gene expression) starts from 0 to 12000 or even more.

If code required, let me know.


#NOTE: Here is my dataset structure and also please note the SC row which a number assigned to each group of samples. 

      Sample1 Sample2 Sample3  ....

GeneA    234    2324  811   4    23    0
GeneB    .     .     .     .    .     .    
SC        1     1      1    2     2     2

x <- data.frame(t(data_set))
intrain <- createDataPartition(y = x$SC, p= 0.7, list = FALSE) 
training <- x[intrain,]
testing <- x[-intrain,]
training[["SC"]] = factor(training[["SC"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(SC ~., data = training, method = "svmLinear",trControl=trctrl,
                preProcess = c("center", "scale"),tuneLength = 10)
test_prediction <- predict(svm_Linear, newdata = testing)
confusionMatrix(test_prediction, testing$SC)


Edit: "RMA" -> "RSEM"

Thanks for any help

ADD COMMENTlink modified 22 months ago • written 2.0 years ago by fernardo 130

If code required, let me know.

Yes code and minimal dataset is required to help you ;)

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Nicolas Rosewick9.3k

Thanks. Plz see the Update :)

ADD REPLYlink written 23 months ago by fernardo 130

If I interpret correctly you are implying you are get 100% accuracy on the test data. As such there is no problem in running SVM on RNA-seq read counts but the results you seem to get are not believable. It is hard to comment without looking at your code snippet.

As such it seems, there is some overfitting is happening and results of the training data itself are being provided. Further it is advisable to normalise the raw counts using VST or some other log based transformation.

ADD REPLYlink written 2.0 years ago by noorpratap.singh300

How did you normalize the RNASeq data? What is is your 5x cross validation results?

ADD REPLYlink written 2.0 years ago by kristoffer.vittingseerup3.5k

@kristoffer, the data were normalized using RMA algorithm.

ADD REPLYlink written 23 months ago by fernardo 130

Please see the update and let me know what you think about the code :)

ADD REPLYlink written 23 months ago by fernardo 130
gravatar for Charles Warden
2.0 years ago by
Charles Warden8.0k
Duarte, CA
Charles Warden8.0k wrote:

If you pick the genes/features using your full set of samples, you won't independent training / validation datasets (which I would argue is why you would test predictability with a machine learning method, versus a statistical test in a smaller set of samples).

If you are using another dataset for your validation, that might be OK. I would typically use some sort of normalized expression (such as Count-Per-Million or Read-Per-Kilobase-per-Million), but your features have be changed with a sufficiently strong difference to be more clear than other factors (such as the library prepration method, unrelated biological differences in the samples, etc.). In other words, you have to be picking of differences that are greater than your typical variable due to confounding factors.

Even though it is kind of something that I had to have greater appreciation for after publication, I think things like Leave-One-Out-Cross-Validation are actually not that great because you either i) violate the independence of the validation with upstream feature selection and/or ii) you define a different model for each sample (meaning you don't have one model that you can test in new samples).

So, my recommendation would be either i) split your data into thirds, 1/3 training and two separate 1/3 validation datasets or ii) perform analysis similar to what you have described and use another large cohort for validation.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Charles Warden8.0k

Thanks for your input. Yes I am using the same dataset for both training and test data and theoretically, I believe it is fine as when we need to find a classifier, it always expects the same pattern of data to do the right classification based on the features, am I right or not?

Plus, please have a look at the Updated part that I added some code and you may have comment on that. Thanks.

ADD REPLYlink written 23 months ago by fernardo 130

Sorry that I didn't have a chance to look at this until later in the day.

With a quick look, I think that is OK (since it looks like you have one training dataset and one validation dataset).

It's been a little while since I worked with SVMs in R (with the e1071 library), but you may also want to mention your dependencies (if I remember correctly, it looks like you are using a different package for your prediction).

ADD REPLYlink written 23 months ago by Charles Warden8.0k

Also, the other thing that could sometimes be an issue is if you introduce bias during upstream normalization.

Strictly speaking, the RMA normalization is based upon your set of samples (and wouldn't be perfectly representative of being able to make an assignment in a new sample, independently of any other samples). However, I don't think that should be as much of an issue as if you did something like apply ComBat normalization and/or impute missing values (most problematically, in your validation samples) before separating training and validation datasets (I think that is more likely to violate the independence of those datasets, and might introduce some over-fitting that makes the clustering of your samples look more similar than they really would be in a new cohort).

ADD REPLYlink modified 23 months ago • written 23 months ago by Charles Warden8.0k

There is also yet another way that you can have a problem, which I would guess is probably happens relatively often (if you don't think carefully about what you are doing). I already alluded to this when offering a critique of LOOCV, but I thought I should say something in terms of your possible code.

I would expect feature selection to occur upstream of the SVM step. If you used something like differential expression for feature extraction upstream of the code you present, then your validation accuracy will be over-fit (since you used your test samples to reduce the number of features considered for your model).

I don't know if this is actually an issue (or whether you provided the full set of data, without your own feature selection step), but it is a possibility based upon what I can see.

ADD REPLYlink written 23 months ago by Charles Warden8.0k

Thanks Charles. The features are not differentially expressed genes. The genes were picked up randomly.

ADD REPLYlink written 23 months ago by fernardo 130

Ok - I see now that you had 75 genes.

Actually, the spacing of the messages added to some confusion, that I didn't notice until re-reading the 1st question. This can definitely be a problem in other situations as well, but I apologize about that.

RMA normalization is for microarray data. I wouldn't expect RMA normalization to be applied to RNA-Seq data. So, with 75 genes, I thought this might have been qPCR or nCounter data. However, if it is RNA-Seq, then I was about to assume it had to be a targeted gene panel (like AmpliSeq). However, the part about RMA expression for RNA-Seq also doesn't make sense.

Are you talking about RNA-Seq data?

If you use different data types, it is also important you don't reduce features with one data type, and then use samples from the same patient (either identical or recurrent). That would also violate the independence of the validation data set.

ADD REPLYlink modified 23 months ago • written 23 months ago by Charles Warden8.0k

Thanks again. Actually, you misunderstood and sorry if I was not clear enough. I am going to try now. I have RNA-seq data (~20,000 genes) that already normalized by RMA, is common and fine. Among those number of genes, I pick up a subset and do this SVM analysis which then comes out with high accuracy as I already mentioned. Perhaps I am going to post a graph or so that shows the accuracy too.

ADD REPLYlink written 23 months ago by fernardo 130

1) One problem is that RMA normalization is for microarrays, not RNA-Seq.

2) You need to explain how you picked that subset of genes. Picking those genes prior to splitting up the training and validation sets can potentially cause problems with the independence of the validation dataset.

ADD REPLYlink written 23 months ago by Charles Warden8.0k

1) Sorry, I guess I misunderstood until now. I was confused actually, the normalization was RSEM, not RMA.

2) Here is two answer for that:

  • Did you mean picking up genes prior to or after? Because in general, I understood that the features must be picked then running training and testing later. Sorry I didn't understand you what you mean exactly because I can't see it is better to pick the genes after splitting the data?
  • I even came up with a strange result that when I do random gene selection, still the accuracy of the classification is too high. That means it is not just for those subsets of genes.
ADD REPLYlink modified 22 months ago • written 22 months ago by fernardo 130

1) Ok - that is the sort of confusion that I want to avoid in the future, but I could see how that may be an issue, particularly if you didn't perform all of the analysis yourself. If you have RNA-Seq data, you can also use unique counts (just in case there was something about how the multi-mapped-reads being assigned to transcripts was not accurate).

2) I guess having a feature selection step does imply some limits for the training step. However, I could see how it might help avoid over-fitting (if you limit the maximal amount of information provided to a model). The bottom line is that if you use all of the data for feature extraction, that would cause your predicted accuracy to be too high (since you picked genes that you already knew varied with the expected trend in all samples, including your validation samples).

Otherwise, if you made certain that there wasn't anything that violated the independence of your test dataset (which it admittedly hard for me to do, with only a subset of the information, and without having done the analysis myself), there is a real possibility that there is no benefit (or even negative benefit, if you encountered over-fitting) to including extra genes for some applications.

There are probably better examples for predictive models, but this paper show a survival difference that is more clear for one gene (IL13Ra2) rather than a set of genes used to define subtypes:

ADD REPLYlink written 22 months ago by Charles Warden8.0k

I did some more test and I like to keep the story simple. So without feature selection, I picked up set of features quite randomly and still gives high accuracy. That is my main issue. I don't use any method to do feature selection for the samples, I pick features randomly without any properties and still gives high accuracy especially if the number of features increases.

ADD REPLYlink written 22 months ago by fernardo 130

Something seems strange about that result, but it is hard for me to say what.

Also, in the interests of fairness, I should probably mention that the method for survival plot is different for IL13Ra2 (where we estimate the percent over-expressed per-cohort) versus the MES/PN stratification (based upon median average expression per-cohort).

Nevertheless, there can definitely be situations where adding unnecessary complications can decrease the robustness of the results. For example, I don't know the exact cause of the issue with the MSigDB signature, but you can a decent example of what over-fitting looks like in this plot from Warden et al. 2013:

Robustness of PR signature

To be fair, the background for random assignment won't always be 50% (if something happens most of the time, or doesn't happen most of the time, always picking the more common status should probably be your model for comparison for rare events). Nevertheless, I think the plot is still a useful example of what you would expect to happen if you over-fit a model with your own dataset.

ADD REPLYlink modified 22 months ago • written 22 months ago by Charles Warden8.0k

Here is one outcome of SVM on the data.

ADD REPLYlink modified 22 months ago • written 22 months ago by fernardo 130

Aside from needing to test the model on independent datasets, I'm not sure what to suggest.

To some extent, you can see the issue with uneven categories (if you just assumed everything was category #2, you'd be right 90% of the time), but I don't think that is the main issue here (since 16/17 are predicted to have status #1).

However, being able to pick any subset of genes and get that result still seems strange. Perhaps there is some sort of bias in the normalization? For example, is there something like the total number of aligned reads that is noticeably different in those 16 samples? If there is something different in those 16 samples that affects all genes during the normalization process, perhaps that can be a factor? That would still seem strange for completely random gene sets, but that is the main thing I can think of (since your code indicates that the training samples were separated before training the model).

ADD REPLYlink written 22 months ago by Charles Warden8.0k

Here is also a simple plot if you can get something out of it. Thanks.

ADD REPLYlink modified 22 months ago • written 23 months ago by fernardo 130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1000 users visited in the last hour