Question

Feature selection for multi-class classification problem.

0

Entering edit mode

1 day ago

Shakunthala Natarajan • 0

I have to classify my samples into 6 different classes using gene expression data as features. I plan to use a logistic regression based ML model or XGboost for the classification. I would like to adopt a feature selection method that captures the highly variable genes. I want to do this in python. I want to find an equivalent approach in python for the BrenneckeGetVariableGenes function of M3Drop R package. Are there any suitable modules or approaches in python that you would recommend for this use case? Thank you!

feature M3Drop R Python selection • 123 views

ADD COMMENT • link updated 15 hours ago by Kevin Blighe 89k • written 1 day ago by Shakunthala Natarajan • 0

score 0 · Answer 1 · 2025-11-07

Hi Shakunthala,

For selecting highly variable genes in Python prior to multi-class classification with logistic regression or XGBoost, I recommend using the scanpy library, which provides robust methods inspired by single-cell RNA-seq workflows. Its highly_variable_genes function (with the 'seurat' or 'dispersion' flavour) closely approximates the Brennecke method from M3Drop by modelling variance against mean expression to identify genes with excess variability beyond technical noise—though note that it lacks direct support for spike-in controls, so results may differ slightly without them.

Assuming your gene expression data is in an AnnData object (rows: samples/cells, columns: genes), here's a concise example to select the top 2000 highly variable genes and prepare features for your model. Install via pip install scanpy if needed.

import scanpy as sc
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import xgboost as xgb  # Optional, for XGBoost

# Load your data (e.g., from CSV: rows=samples, columns=genes + 'class' label)
df = pd.read_csv('your_expression_data.csv', index_col=0)
adata = sc.AnnData(df.drop('class', axis=1).T)  # Transpose to genes x samples for scanpy
adata.obs['class'] = df['class']  # Add labels

# Normalise and select highly variable genes (Brennecke-like via 'seurat' flavour)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor='seurat', n_top_genes=2000)
hvgs = adata.var[adata.var.highly_variable].index.tolist()

# Prepare ML features/labels (samples x HVGs)
X = pd.DataFrame(adata.X.T[:, adata.var.highly_variable].toarray(), index=df.index)  # Dense if sparse
y = df['class']

# Example: Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
model = LogisticRegression(multi_class='ovr', max_iter=1000)
model.fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))

# For XGBoost (adapt similarly)
# model = xgb.XGBClassifier(objective='multi:softprob', num_class=6)
# model.fit(X_train, y_train)

This keeps things efficient for your 6-class setup, but test different n_top_genes values via cross-validation to avoid overfitting. Limitations include scanpy's assumption of single-cell-like data (it works for bulk but may need tweaking for very large sample counts) and no exact spike-in fitting—consider a custom GLM if ERCCs are available.

Kevin