Question

best transformation for large RNAseq dataset

0

Entering edit mode

3.1 years ago

loipf • 0

Hi all,

I would like to analyze a huge amount of RNAseq samples from ARCHS4, so probably around 100 000 samples x 60 000 genes, all from different experiments, setups and conditions. After filtering bad quality samples, whats the best way to proceed to transform my data, to use it for further statistical analysis (like PCA, no DE) which expects a Gaussian distribution ? (The method later on take care of batch effect correction)

results from my search so far:

there is no right or wrong solution and it always depends and there are a lot of comparative studies with also opposite opinions on it out there :/
quantile normalization shouldn't be blindly applied and isn't straightforward for multiple condition datasets [Zhao 2020]
there is also a study [Mancuso 2020] that transformed the data using the inverse hyperbolic sine (archsinh) function. what is the advantage over other methods for this?
I personally would use DESeq2 to normalize for size factors and its variance stabilizing transformation afterward to get more homoscedastic data. This would be my input for later analysis. is this allowed or do I interfere with some statistical requirements ? Can I go wrong with this approach ?

Thank you in advance

RNA-Seq DESeq2 normalization transformation VST • 787 views

ADD COMMENT • link 3.1 years ago by loipf • 0

0

Entering edit mode

With that many samples why not just normalizing for depth? I have a hard time that with this excessive number of samples and all the batch effects that come with it you gain anything from a more elaborate normalization. Normalization is meant to eliminate technical biases within the same experiment, you seem to have collected like all datasets from GEO and now assume that they can be meaningfully combined, I really doubt that, no matter how you normalize it.

ADD REPLY • link 3.1 years ago by ATpoint 82k

0

Entering edit mode

accidentally responded in the main thread ..

ADD REPLY • link 3.1 years ago by loipf • 0

0

Entering edit mode

If I am not wrong, DESeq2 size factors should account for sequencing depth and the variance stabilizing transformation should produce log2 counts. This would be my input for machine learning tasks and an autoencoder can take care of the batch effects. So the batch effects are really not a problem, rather whats the "best" input for it. So what kind of pre-scaling of the samples is necessary, if any at all. But I imagine there is no best solution and every normalization has its advantages and disadvantages.

ADD REPLY • link 3.1 years ago by loipf • 0