Which normalization method/pipeline is best to generate normalized counts to make different single cell experiments comparable
1
0
Entering edit mode
17 months ago
amjass ▴ 20

Hi,

I have raw counts from multiple different single cell RNAseq experiments from different sources (different sequencing technologies etc). I need to generate a matrix of normalized counts for every experiment such that they are relatively similar to one another for a downstream ML exercise.

What is the recommended way/tools to use to normalize data like this? One could argue the datasets and even individual cells are not directly being compared to one another as they will be classified, but I need a reasonable level of normalization across datasets generated from different technologies!

thank you

single cell normalization • 898 views
0
Entering edit mode

Do the different datasets contain roughly the same celltypes or is this rather a mishmash of experiments?

0
Entering edit mode

its a bit of both, some experiments will contain similar cell types and others certainly contain a mish-mash...

0
Entering edit mode

I think you will have a hard time trying to tweak the data for your purposes. Single-cell assays can suffer from severe batch effects between experiments. It is possible to integrate assays, with tools such fastMNN, but this usually results in corrected values in PCA space which can then be used to generate a unified clustering landscape. Both the obtained PCA-space values and the corrected like "counts" if you will are not recommended for anything but visualization as the integration procedure creates dependencies between the data and can even change directions of counts and negative values. I was asking for the data composition as one might try to regress the different experiments to get corrected counts but this probably only makes sense if the data between experiments are actually the same and only the "study" factor is the confounding event. Sure, you can run any of the standard normalization techniques, be it the TPM or more elaborate and singlecell-specific ones such as the deconvolution method from scran or model-based normalizations such as the sctransform variance stabilizing transformation on your datasets but the strong confounding will remain. There is probably a good change that your results will suffer from the batch effects. It is often not easily possible to just collect unrelated experiments and pretend they could be pressed into a meaningful analysis as if confounding would not be present. Maybe running whatever analysis you want on the individual datasets and then combine results later, would that be an option?

0
Entering edit mode

thank you for the detailed response! all of this make perfect sense.

Just to answer your last question - no, it needs to be combined - it will be used for training a classifier which will require all data - I think in the absense of any established way of doing this, and as the comment below suggests something similar- I will opt for TPM normalisation (or VST normalisation as this is something I am currently using for another project - although all batches contain relatively similar samples) - I will likely compare both methods and see how they affect downstream results..

I am wondering how confounding batch may be if proportionally higher transcripts are commonly high across different cells - the data will ultimately be log transformed and then scaled to between a smaller range - this is why I am mor einclined to use a VST to avoid outlier values that could also mess with the squashing of value later on!

0
Entering edit mode

The vst is for UMI data though, you would need to check whether all platforms have that. I would use the method in scran though tbh as it corrects for compositional changes which you almost certainly will have as different single-cell platforms have different ways of capturing and processing the transcripts (end-tagged vs full length), therefore in full-length longer transcripts will have inherently higher counts than shorter ones at equal expression level plus the plate-based technologies generally have higher depth per cell but fewer cells overall compared to droplet-based technologies. I doubt that something as simple as TPM will do a good job, there is to my knowledge no benchmark that (either for bulk or single-cell) has ever explicitely recommended a simple per-million technique that corrects only for depth rather than composition. But as said, best to try and compare.

0
Entering edit mode

thank you I will do this - if all are UMI - VST is viable and one that I would probably prefer (but still compare to other approaches)

I am not familiar with the Scran approach so will read the documentation now! May I ask what the specific method is in scran (as in the function so I can read up on this fully and see how it is working) - presumably I will get back a matrix of normalised counts?

Thank you for the comments re: TPM - I have read similar comments elsewhere which is what prompted to ask this question to see if there were better/established ways.

2
Entering edit mode

It is an awesome read for basically everything related to scRNA-seq with regard to the Bioconductor universe.

0
Entering edit mode
17 months ago
Mensur Dlakic ★ 20k

It seems like you want to normalize the data so that the number of transcripts in all datasets is the same. There are many ways of doing that, and I will recommend transcripts per million (TPM) normalization. See here how to do it directly if you wish. Alternatively, a program like kallisto will do all the steps for you assuming that you have the reads for all your technologies. This will include normalization by effective rather than raw length.

0
Entering edit mode

thank you - and can TPM counts be applied individually to samples and then merged? presumably yes as its a simple TPM across each cell?

Traffic: 1269 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.