Question

Machine Learning Classifiers on TCGA dataset(tpm_unstrand) gene expression data set

0

Entering edit mode

21 months ago

Jakpa ▴ 50

Hi,

I have Bladder Cancer dataset downloaded from TCGA data base. I am doing ML classification algorithm to predict class of Bladder Cancer from the clinical data as target and (tpm_unstrand) gene expression data as features.

downloaded dataset:

query_TCGA = GDCquery(   project = "TCGA-BLCA",   data.category =
 "Transcriptome Profiling",   data.type = "Gene Expression
 Quantification",   experimental.strategy = "RNA-Seq",   workflow.type
= "STAR - Counts",   barcode = c("TCGA-*"))

The data:

     data_Bca.shape

    (428, 4933)

datasample

Target: BlcaGrade

Preprocessing:

remove low variance columns

remove columns with similar values

remove highly correlated columns

using mutual information to remove columns with no information

After preprocessing I was left with 1000 features

ML Modeling Random Forest

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0, stratify=y)

rfc = RandomForestClassifier(random_state=0)

rfc.fit(X_train, y_train) 
y_pred = rfc.predict(X_test)


print(classification_report(y_test,y_pred )) 

print(accuracy_score(y_test, y_pred ))


>                   precision    recall  f1-score   support
>     
>                0       0.53      0.56      0.55        41
>                1       0.56      0.53      0.55        43
>     
>         accuracy                           0.55        84
>        macro avg       0.55      0.55      0.55        84
>     weighted avg       0.55      0.55      0.55        84
> 
> 0.5476

I did this:

rfc.score(X_train, y_train)

1.0

I can see that my model is over fitting. Then I applied GridseachCv.

param_dict = dict(n_estimators = estimators,
max_depth = max_depth, 
min_samples_split = samples_split, 
#min_samples_leaf = samples_leaf)
                  criterion = criterion)

gv = GridSearchCV(rfc, 
                  param_dict,
                  cv = 3, 
                  verbose = 1, 
                  n_jobs = -1)


best_params = gv.fit(X_train, y_train)

best_params.best_params_

{'criterion': 'entropy',
 'max_depth': 5,
 'min_samples_split': 2,
 'n_estimators': 400}

best_params.best_params_
{'criterion': 'entropy',
 'max_depth': 5,
 'min_samples_split': 2,
 'n_estimators': 400}

I got accuracy of 0.53 on test set

I have tried different parameter tuning but the model is still over fitting.

am I using the wrong data? or is it not possible to do Machine learning on gene expression data sets?

any suggestion on what to do differently?

machine RNASeq python learning • 560 views

ADD COMMENT • link 21 months ago by Jakpa ▴ 50