I would like to analyze a huge amount of RNAseq samples from ARCHS4, so probably around 100 000 samples x 60 000 genes, all from different experiments, setups and conditions. After filtering bad quality samples, whats the best way to proceed to transform my data, to use it for further statistical analysis (like PCA, no DE) which expects a Gaussian distribution ? (The method later on take care of batch effect correction)
results from my search so far:
there is no right or wrong solution and it always depends and there are a lot of comparative studies with also opposite opinions on it out there :/
quantile normalization shouldn't be blindly applied and isn't straightforward for multiple condition datasets [Zhao 2020]
there is also a study [Mancuso 2020] that transformed the data using the inverse hyperbolic sine (archsinh) function. what is the advantage over other methods for this?
I personally would use DESeq2 to normalize for size factors and its variance stabilizing transformation afterward to get more homoscedastic data. This would be my input for later analysis. is this allowed or do I interfere with some statistical requirements ? Can I go wrong with this approach ?
Thank you in advance