Question

Data Preparation for single-cell Machine Learning classification (svm + RF)

0

Entering edit mode

3.4 years ago

fracarb8 ★ 1.7k

I am working with single-cell RNA-Seq data, and I am trying to building a classifier capable of predicting if samples are controls or patients.

I have a training dataset with around 450000 cells coming from ~50 samples from different projects, each project containing both controls and patient data. The idea is to train a classifier on the 50 samples datset and predict the status of new patients as they coming in.

My question is: How do I pre-process the data of the new patients?

The reason I am confused is that for the training data, everything is normalised and scaled together. During integration, I account for the different origin of the samples by regressing out factors like sampleID, projects, experiment chemistry,.... This is not happening for the new samples, as they are analysed independently from any other sample.

This is what I did so far:

Integrate the ~50 samples with seurat (SCT+rPCA)
Run NormaliseData and ScaleData(...,vars.to.regress = c("percent.mt", "SampleID",.. )) on the RNA assay.
Extract the scale.data slot from the RNA assay
Select the list of genes to use to train the model. This is done by combining the variable genes (FindVariableFeatures) and the results of standard Feature selection algorithm (e.g. Boruta)
Train/test
Split the data (80/20)
Train an ensemble classifier (svm + rf)
Predict on the test data

I still need to tweak and improve the model, but so far, I can reach good accuracy on the prediction on the test dataset.

When a new patient arrives, I am planning on:

Analyse the data with seurat
Extract the scale.data slot from the RNA assay
Predict

Is it correct to feed the seurat normalised+scaled data for the prediction?

Would be better to ditch seurat entirely, start from the raw count, and normalise+scale the data in the same way for both the training samples and the new patients?

Would it be even better to integrate the new samples with the original dataset, and predict on the globally normalise dataset?

seurat machine-learning scRNA-Seq • 578 views

ADD COMMENT • link 3.4 years ago by fracarb8 ★ 1.7k