Hi Shakunthala,
For selecting highly variable genes in Python prior to multi-class classification with logistic regression or XGBoost, I recommend using the scanpy library, which provides robust methods inspired by single-cell RNA-seq workflows. Its highly_variable_genes function (with the 'seurat' or 'dispersion' flavour) closely approximates the Brennecke method from M3Drop by modelling variance against mean expression to identify genes with excess variability beyond technical noise—though note that it lacks direct support for spike-in controls, so results may differ slightly without them.
Assuming your gene expression data is in an AnnData object (rows: samples/cells, columns: genes), here's a concise example to select the top 2000 highly variable genes and prepare features for your model. Install via pip install scanpy if needed.
import scanpy as sc
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import xgboost as xgb # Optional, for XGBoost
# Load your data (e.g., from CSV: rows=samples, columns=genes + 'class' label)
df = pd.read_csv('your_expression_data.csv', index_col=0)
adata = sc.AnnData(df.drop('class', axis=1).T) # Transpose to genes x samples for scanpy
adata.obs['class'] = df['class'] # Add labels
# Normalise and select highly variable genes (Brennecke-like via 'seurat' flavour)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor='seurat', n_top_genes=2000)
hvgs = adata.var[adata.var.highly_variable].index.tolist()
# Prepare ML features/labels (samples x HVGs)
X = pd.DataFrame(adata.X.T[:, adata.var.highly_variable].toarray(), index=df.index) # Dense if sparse
y = df['class']
# Example: Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
model = LogisticRegression(multi_class='ovr', max_iter=1000)
model.fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))
# For XGBoost (adapt similarly)
# model = xgb.XGBClassifier(objective='multi:softprob', num_class=6)
# model.fit(X_train, y_train)
This keeps things efficient for your 6-class setup, but test different n_top_genes values via cross-validation to avoid overfitting. Limitations include scanpy's assumption of single-cell-like data (it works for bulk but may need tweaking for very large sample counts) and no exact spike-in fitting—consider a custom GLM if ERCCs are available.
Kevin