Question: single cell RNA-seq anlysis, PCA method, how to choose variables which contribute most to components?
4
gravatar for zhenyisong
5.1 years ago by
zhenyisong130
China
zhenyisong130 wrote:

I read the paper by Quake lab about using single cell RNA-seq to find new cell lineage marker in lung development. Their method is to use PCA (principle component analysis) to  select genes to do unsupervised hierarchical clustering (HC). Here they described that " Genes with highest loadings in the first four components were analysed by unsupervised hierarchical clustering as well as PCA". I think the loading has an equivalent concept to Eigenvector. Hence, to do the analysis, they generated m×4 matrix (m = gene number,loading matrix?) so, my problem is: how do we choose those genes with highest loadings?

(1) select those genes which has the largest sum of weights (I mean, sum of each row, thus m×1, then order them) or

(2) select those genes which has one of largest weight in either of four columns

The solution is (1) or (2)?  or I mis-understand the concept of PCA?

Gene Lists Using Principal Component Analysis In Microarray Gene Expression but I think they described a n×1 loading matrix.

BTW, is there another way to infer the new cell lineage or classify groups of cells? Is there a evaluation report on those methods? TIA

 

pca • 4.9k views
ADD COMMENTlink modified 4.8 years ago by Jean-Karim Heriche21k • written 5.1 years ago by zhenyisong130

I'm having the same questions and was wondering if you have made any progress on this?

ADD REPLYlink written 4.8 years ago by gaelgarcia150
1

No. Someone suggested that the first way is OK (add the weights together and then ordering). But I did not find this explanation from the textbook. I wrote a letter to the authors and asked the source code, but no response. Anyway, if you find the answer,do let me know.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by zhenyisong130
1
gravatar for Jean-Karim Heriche
4.8 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche21k wrote:

From the methods section of the paper:

... genes with the highest PC loadings (highest absolute correlation coefficient with one of the first three to four principal components) were identified.

 

ADD COMMENTlink written 4.8 years ago by Jean-Karim Heriche21k

I mean, how the highest PC loadings were calculated? I want to make sure whether "the first three or four" weights are added up and ordered by the sum (row) or single out one largest pc in those four components and then make an order. My understanding is that there are still two possible ways to interpret their description in the Method section. It is a bit of confusing for non-English background. Thanks.

ADD REPLYlink written 4.8 years ago by zhenyisong130

My interpretation of this is that they went with option 2 you mentioned. The code used is in supplementary data 2 of the paper and here is what I believe to be the relevant section from file Ranalysis_scRNAseq_E18_80cells_paper.txt:

PCA.allgenes = PCA(PCA.data.log2.single, ncp=4, graph=T)
#PCA(PCA.data.log2.single, axes=c(3, 4))
dimension.PCA.allgenes<-dimdesc(PCA.allgenes, axes=c(1,2,3,4))
dim4<-as.data.frame(dimension.PCA.allgenes[[4]])
dim3<-as.data.frame(dimension.PCA.allgenes[[3]])
dim2<-as.data.frame(dimension.PCA.allgenes[[2]])
dim1<-as.data.frame(dimension.PCA.allgenes[[1]])
//
genes.corr.dim<-unique(c(row.names(dim1[c(1:18),]),row.names(dim1[(nrow(dim1)-10):nrow(dim1),]),row.names(dim2[c(1:18),]),row.names(dim2[(nrow(dim2)-18):nrow(dim2),]),row.names(dim3[1:18,]),row.names(dim3[(nrow(dim3)-18):nrow(dim3),]),row.names(dim4[(nrow(dim4)-18):nrow(dim4),])))

PCA(PCA.data.log2.single[,c(genes.corr.dim)], axes=c(1, 2))

#Hierarchical clustering with genes identified by PCA to correlate strongly with principal components:
data.cluster.candidates<-cbind(data.cast.log2.single[,1:7],data.cast.log2.single[,c(genes.corr.dim)])

hc.candidates <- hclust(as.dist(1-abs(cor(data.cluster.candidates[,8:ncol(data.cluster.candidates)],method="spearman"))), method="ward")

However, the code is not well documented to say the least and I find it unreadable but maybe that's just me not being a strong R programmer. I suspect the job is done in the dimdesc function but there's no way to know which package provides any of the functions used. My guess is that it's all in the FactoMineR package.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Jean-Karim Heriche21k

I greatly appreciate your help. Thanks again. However, I am wondering if this approach (Option 2) is empirical method or has some convincing reason to do so (reference?)?

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by zhenyisong130

The loadings can be viewed as the correlation between the genes and the components so selecting in this way, you select genes that are strongly associated (positively or negatively) with a component which makes sense if you want to characterize genes specific of a disease associated with a given component. If you take option 1, you would end up with non-specific genes because any gene with a strong association with more than one component would rank high.

ADD REPLYlink written 4.8 years ago by Jean-Karim Heriche21k

I remember that each Principle Component(PC) has its weight. Should we times each weight and compare those four components before selecting the max one?

ADD REPLYlink written 4.8 years ago by zhenyisong130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1326 users visited in the last hour