Hi,
I have Bladder Cancer dataset downloaded from TCGA data base. I am doing ML classification
algorithm to predict class of Bladder Cancer from the clinical data as target and (tpm_unstrand)
gene expression data as features.
downloaded dataset:
query_TCGA = GDCquery( project = "TCGA-BLCA", data.category =
"Transcriptome Profiling", data.type = "Gene Expression
Quantification", experimental.strategy = "RNA-Seq", workflow.type
= "STAR - Counts", barcode = c("TCGA-*"))
The data:
data_Bca.shape
(428, 4933)
Target: BlcaGrade
Preprocessing:
remove low variance columns
remove columns with similar values
remove highly correlated columns
using mutual information to remove columns with no information
After preprocessing I was left with 1000 features
ML Modeling Random Forest
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0, stratify=y)
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print(classification_report(y_test,y_pred ))
print(accuracy_score(y_test, y_pred ))
> precision recall f1-score support
>
> 0 0.53 0.56 0.55 41
> 1 0.56 0.53 0.55 43
>
> accuracy 0.55 84
> macro avg 0.55 0.55 0.55 84
> weighted avg 0.55 0.55 0.55 84
>
> 0.5476
I did this:
rfc.score(X_train, y_train)
1.0
I can see that my model is over fitting. Then I applied GridseachCv.
param_dict = dict(n_estimators = estimators,
max_depth = max_depth,
min_samples_split = samples_split,
#min_samples_leaf = samples_leaf)
criterion = criterion)
gv = GridSearchCV(rfc,
param_dict,
cv = 3,
verbose = 1,
n_jobs = -1)
best_params = gv.fit(X_train, y_train)
best_params.best_params_
{'criterion': 'entropy',
'max_depth': 5,
'min_samples_split': 2,
'n_estimators': 400}
best_params.best_params_
{'criterion': 'entropy',
'max_depth': 5,
'min_samples_split': 2,
'n_estimators': 400}
I got accuracy of 0.53 on test set
I have tried different parameter tuning but the model is still over fitting.
am I using the wrong data? or is it not possible to do Machine learning on gene expression data sets?
any suggestion on what to do differently?