Question

PCA and loadings question

0

Entering edit mode

3.8 years ago

camillab. ▴ 160

Hi!

I have a problem with PCA: I don't understand when it's recommended to scale or not and how to identify the loadings for PC1 (so the genes that in theory are most responsible for the variance and so if, I have understood correctly, are responsible to cluster my samples far away on the PC1). I have only RPKM/FPKM dataset (bulk-RNAseq) so unfortunately, I cannot perform the PCA on raw data.

Is it correct to perform PCA on normalized data (RPKM/FPKM)?
Assuming it's correct, Log transformation prior the PCA: is it correct to add a pseudo- count (log2 n + 1) or should I do only log transform? (I have followed this for my code: https://jackauty.com/pca-and-3d-pca/)
I have previously asked you how to scale the bulkRNAseq and following your suggestions I should set scale=F since I have already normalized the dataset prior the PCA to not lose important information (Question: PCA: log transform and scale=T/F?). But if I want to do the PCA on genes shared across multiple datasets (each dataset sequenced at different time, different species, and so on) is it better to set the center=T and scale=F or =T? I did log transform+pseudocount and then I did both "approaches" (scaling or not scaling). Is it correct thinking that it would be more correct set scale=T rather than scale=F since scaling has the advantage that all my genes are on the same scale regardless of expression level? The consequence of scaling is that that lowly-expressed genes have the same impact as highly-expressed genes and if we analyzed each dataset separately I would agree that genes with higher expression levels are more relevant overall towards shaping the cellular identity compared to low expressed genes but I am not sure I am able to see how similar/different are tissue from different species. (I don't know if this makes sense...)
I compared different datasets generated at different times of different species (I do not have the raw count) trying two approaches: initially I have identified shared genes across the dataset, then performed PCA/heatmap, but I have also followed the suggestion to convert each dataset into Z-scores, then identify shared genes and then PCA/heatmap (Comparison different Datasets without having the FASTQ ). Now if I want to see which genes are responsible for cluster my samples far from each other (= which genes are resposible for the most variance) so, as far as I understood, I should look at the loadings/rotations but I am confused about the "loadings/rotation". Each gene contributes differently to the different PCs but those that contribute to cluster far apart my samples are those with the higher variance and they are represented with the PC1. so if I select the top 500 in PC1 and I re-run the PCA only on those 500 I should see that the samples cluster very far apart from each other? Is it correct? so if I select those that in the PC1 has less "weight" when computing again the PCA they should make my samples cluster together/closer than before. is it correct? And also, if all these are correct, should I center and scale or not in my PCA or is there a better what to plot the loadings? why all my loadings have only positive numbers ? is it because I log transformed + pseudocount the dataset before running the PCA?

Apologies if these question are stupid and thank you for your time!

Camilla

bulkRNA PCA loadings scale • 1.4k views

ADD COMMENT • link updated 3.7 years ago by Biostar 20 • written 3.8 years ago by camillab. ▴ 160

1

Entering edit mode

Can you take a step back and try to explain what exactly the question is that you're trying to address? I.e. why are you interested in the loadings to begin with?

ADD REPLY • link 3.8 years ago by Friederike 8.9k

0

Entering edit mode

So assuming you have a PCA like this : https://campus.datacamp.com/courses/chip-seq-with-bioconductor-in-r/comparing-chip-seq-samples?ex=2 ]you see that the pink samples(=TURP) do not cluster with the purple one (=primary), if I want to see which genes make these samples to not cluster together I should look at the loadings of the PC1 which represent the genes that contributes to the most variance so those genes that are responsible to do not cluster the samples together (so, those genes that differ most across the 4 samples). Am I right?

So I tried to check for the loadings but I got only positive values which I believe might be due to the fact that I log transform + 1 the dataset before computing the PCA but 1. I am not sure log2 n +1 is the right way to log transform 2. why do I get only positive values and how to interpret them

ADD REPLY • link 3.8 years ago by camillab. ▴ 160

0

Entering edit mode

The typical approach with RNA-seq data would be to do differential gene expression analysis using sophisticated t-tests on data that has been optimally normalized where the normalization would be taken care of by the package, e.g. DESeq2 or edgeR. Of course, you will need raw counts for that. What is the reason you only have the RPKM values?

To address your concern about positive loadings only -- I'm not sure why you are concerned about that. It just means that those top loadings are mostly positively correlated.

ADD REPLY • link 3.8 years ago by Friederike 8.9k

0

Entering edit mode

I have no negative loadings at all. Is it normal?

the dataset downloaded from NCBI GEO have the normalised reads and not raw counts

ADD REPLY • link 3.8 years ago by camillab. ▴ 160

1

Entering edit mode

I have no negative loadings at all. Is it normal?

How many PCs have you looked at?

the dataset downloaded from NCBI GEO have the normalised reads and not raw counts

If this is data from GEO you could download the FASTQ files, do the alignment and raw read counting yourself. If you want to draw conclusions from this analysis that are meant to guide future research, I would strongly recommend to do that.

ADD REPLY • link 3.8 years ago by Friederike 8.9k