Machine Learning Classifiers on TCGA dataset(tpm_unstrand) gene expression data set
0
0
Entering edit mode
21 months ago
Jakpa ▴ 50

Hi,

I have Bladder Cancer dataset downloaded from TCGA data base. I am doing ML classification algorithm to predict class of Bladder Cancer from the clinical data as target and (tpm_unstrand) gene expression data as features.

downloaded dataset:

query_TCGA = GDCquery(   project = "TCGA-BLCA",   data.category =
 "Transcriptome Profiling",   data.type = "Gene Expression
 Quantification",   experimental.strategy = "RNA-Seq",   workflow.type
= "STAR - Counts",   barcode = c("TCGA-*"))

The data:

     data_Bca.shape

    (428, 4933)

datasample

Target: BlcaGrade

Preprocessing:

remove low variance columns

remove columns with similar values

remove highly correlated columns

using mutual information to remove columns with no information

After preprocessing I was left with 1000 features

ML Modeling Random Forest

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0, stratify=y)

rfc = RandomForestClassifier(random_state=0)

rfc.fit(X_train, y_train) 
y_pred = rfc.predict(X_test)


print(classification_report(y_test,y_pred )) 

print(accuracy_score(y_test, y_pred ))


>                   precision    recall  f1-score   support
>     
>                0       0.53      0.56      0.55        41
>                1       0.56      0.53      0.55        43
>     
>         accuracy                           0.55        84
>        macro avg       0.55      0.55      0.55        84
>     weighted avg       0.55      0.55      0.55        84
> 
> 0.5476

I did this:

rfc.score(X_train, y_train)

1.0

I can see that my model is over fitting. Then I applied GridseachCv.

param_dict = dict(n_estimators = estimators,
max_depth = max_depth, 
min_samples_split = samples_split, 
#min_samples_leaf = samples_leaf)
                  criterion = criterion)

gv = GridSearchCV(rfc, 
                  param_dict,
                  cv = 3, 
                  verbose = 1, 
                  n_jobs = -1)


best_params = gv.fit(X_train, y_train)

best_params.best_params_

{'criterion': 'entropy',
 'max_depth': 5,
 'min_samples_split': 2,
 'n_estimators': 400}

best_params.best_params_
{'criterion': 'entropy',
 'max_depth': 5,
 'min_samples_split': 2,
 'n_estimators': 400}

I got accuracy of 0.53 on test set

I have tried different parameter tuning but the model is still over fitting.

am I using the wrong data? or is it not possible to do Machine learning on gene expression data sets?

any suggestion on what to do differently?

machine RNASeq python learning • 560 views
ADD COMMENT

Login before adding your answer.

Traffic: 894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6