Question: Is it fine to run SVM on RNA-seq read counts?
0
gravatar for fernardo
8 days ago by
fernardo 110
Italy
fernardo 110 wrote:

Hi,

I have 930 samples of RNA-seq for 4 conditions. I am using RMA processed for the RNA-seq data. Now I am doing the following steps:

1- Picking 75 genes of interest for the 930 samples.

2- Importing such table into SVM to classify those 4 conditions based on those genes. (NOTE: 75% training set, 25% test set)

3- Result: 100% true position. NOTE: even if I decrease the number of features (genes) from 75 to 25, it gives the same result.

Does any know this problem? can SVM be used for multiple classifications on such data?

Data content (gene expression) starts from 0 to 12000 or even more.

If code required, let me know.

Thanks for any help

ADD COMMENTlink modified 4 days ago by Charles Warden5.9k • written 8 days ago by fernardo 110

If code required, let me know.

Yes code and minimal dataset is required to help you ;)

ADD REPLYlink modified 4 days ago • written 4 days ago by Nicolas Rosewick7.2k

If I interpret correctly you are implying you are get 100% accuracy on the test data. As such there is no problem in running SVM on RNA-seq read counts but the results you seem to get are not believable. It is hard to comment without looking at your code snippet.

As such it seems, there is some overfitting is happening and results of the training data itself are being provided. Further it is advisable to normalise the raw counts using VST or some other log based transformation.

ADD REPLYlink written 4 days ago by noorpratap.singh140

How did you normalize the RNASeq data? What is is your 5x cross validation results?

ADD REPLYlink written 4 days ago by kristoffer.vittingseerup1.2k
0
gravatar for Charles Warden
4 days ago by
Charles Warden5.9k
Duarte, CA
Charles Warden5.9k wrote:

If you pick the genes/features using your full set of samples, you won't independent training / validation datasets (which I would argue is why you would test predictability with a machine learning method, versus a statistical test in a smaller set of samples).

If you are using another dataset for your validation, that might be OK. I would typically use some sort of normalized expression (such as Count-Per-Million or Read-Per-Kilobase-per-Million), but your features have be changed with a sufficiently strong difference to be more clear than other factors (such as the library prepration method, unrelated biological differences in the samples, etc.). In other words, you have to be picking of differences that are greater than your typical variable due to confounding factors.

Even though it is kind of something that I had to have greater appreciation for after publication, I think things like Leave-One-Out-Cross-Validation are actually not that great because you either i) violate the independence of the validation with upstream feature selection and/or ii) you define a different model for each sample (meaning you don't have one model that you can test in new samples).

So, my recommendation would be either i) split your data into thirds, 1/3 training and two separate 1/3 validation datasets or ii) perform analysis similar to what you have described and use another large cohort for validation.

ADD COMMENTlink modified 4 days ago • written 4 days ago by Charles Warden5.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 775 users visited in the last hour