I have a problem with PCA: I don't understand when it's recommended to scale or not and how to identify the loadings for PC1 (so the genes that in theory are most responsible for the variance and so if, I have understood correctly, are responsible to cluster my samples far away on the PC1). I have only RPKM/FPKM dataset (bulk-RNAseq) so unfortunately, I cannot perform the PCA on raw data.
Is it correct to perform PCA on normalized data (RPKM/FPKM)?
Assuming it's correct, Log transformation prior the PCA: is it correct to add a pseudo- count (log2 n + 1) or should I do only log transform? (I have followed this for my code: https://jackauty.com/pca-and-3d-pca/)
I have previously asked you how to scale the bulkRNAseq and following your suggestions I should set scale=F since I have already normalized the dataset prior the PCA to not lose important information (Question: PCA: log transform and scale=T/F?). But if I want to do the PCA on genes shared across multiple datasets (each dataset sequenced at different time, different species, and so on) is it better to set the center=T and scale=F or =T? I did log transform+pseudocount and then I did both "approaches" (scaling or not scaling). Is it correct thinking that it would be more correct set scale=T rather than scale=F since scaling has the advantage that all my genes are on the same scale regardless of expression level? The consequence of scaling is that that lowly-expressed genes have the same impact as highly-expressed genes and if we analyzed each dataset separately I would agree that genes with higher expression levels are more relevant overall towards shaping the cellular identity compared to low expressed genes but I am not sure I am able to see how similar/different are tissue from different species. (I don't know if this makes sense...)
I compared different datasets generated at different times of different species (I do not have the raw count) trying two approaches: initially I have identified shared genes across the dataset, then performed PCA/heatmap, but I have also followed the suggestion to convert each dataset into Z-scores, then identify shared genes and then PCA/heatmap (Comparison different Datasets without having the FASTQ ). Now if I want to see which genes are responsible for cluster my samples far from each other (= which genes are resposible for the most variance) so, as far as I understood, I should look at the loadings/rotations but I am confused about the "loadings/rotation". Each gene contributes differently to the different PCs but those that contribute to cluster far apart my samples are those with the higher variance and they are represented with the PC1. so if I select the top 500 in PC1 and I re-run the PCA only on those 500 I should see that the samples cluster very far apart from each other? Is it correct? so if I select those that in the PC1 has less "weight" when computing again the PCA they should make my samples cluster together/closer than before. is it correct? And also, if all these are correct, should I center and scale or not in my PCA or is there a better what to plot the loadings? why all my loadings have only positive numbers ? is it because I log transformed + pseudocount the dataset before running the PCA?
Apologies if these question are stupid and thank you for your time!