Question: Scale and Center [normalized] RNA-seq expression counts for PCA ?
1
gravatar for gaelgarcia
4.2 years ago by
gaelgarcia140
UK
gaelgarcia140 wrote:

I have a dataset of hundreds of different samples and their normalized (by library size) feature counts.

I want to perform downstream analysis on these data, starting with PCA using `prcomp`. Should I center and scale the values before PCA, or is the normalization of reads enough?

Thanks!

sequencing rna-seq pca R genome • 6.1k views
ADD COMMENTlink modified 8 months ago by Kevin Blighe41k • written 4.2 years ago by gaelgarcia140
1

You can use log transformed values and also add pseudocounts  to reduce the bias towards highly expressed transcripts. Always higher values (expression) dominates the variation levels between the samples than the lower values (or less expressed transcripts). If you do not use pseudocount the results will be completely based on highly expressed transcripts. This plays a major role when you do analysis on less expressed transcripts especially the non-coding RNAs along with protein coding ones.

ADD REPLYlink written 4.2 years ago by EagleEye6.2k

Thanks. I wasn't referring to log-transforming, but to "scaling" and "centering" which are standarization options when performing PCA. Since I am not visualizing the expression counts, but rather the samples and their coordinates in PCA space, I don't think it makes any difference if I log-transform the data in this case.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by gaelgarcia140

I could not able to explain you properly but you will get better explanation here : 

https://www.researchgate.net/post/What_is_the_best_way_to_scale_parameters_before_running_a_Principal_Component_Analysis_PCA 

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by EagleEye6.2k
0
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe41k
London, England
Kevin Blighe41k wrote:

That option of prcomp just helps to 'iron out' the 'bumps' in the data. If you don't perform it, you may very well observe PC1 representing more variance in the dataset than it normally would. This is why we would typically perform PCA on logged counts in the first place, i.e., because the distributon of logged data is more 'smooth' than that of unlogged.

I have written more about PCA here:

Kevin

ADD COMMENTlink written 8 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1132 users visited in the last hour