I am interested in the identification of marker genes of infection by a parasite. I have several biological samples corresponding to different infection stages. I quantified the expression of genes that are though to be differentially expressed by qPCR. Now I would like to run a principal component analysis (PCA) to (1) cluster the samples based on gene expression and (2) identify which of the genes contribute the most to clustering.
From the qPCR experiment, I got data under three forms: Cq values, relative quantities and normalized expression against reference genes.
I read tutorials but most of them are focused on RNAseq data analysis, not qPCR. Some of my questions remain unanswered and I really hope to get some more explanations here.
Which data is it better to use for the PCA? Intuitively I would use normalized expression because the variance between samples has been taken into account. But I can see, for example in the HTqPCR package, that Cq values are rather plotted.
Should I apply a log transformation such as log(N+1)? My data have a logarithmic distribution. I read that normality is not an assumption of PCA; but the closer the data are to a normal distribution, the better the PCA performs.
Should I scale the data? The
prcompfunction in R (for example) offers this possibility. But I guess this is related to the type of data that is used. My feeling is that this is useless providing I use normalised gene expression, for the reason I provided in my first question.
I tried several combinations and they did not yield in the same output. I would like to find the proper, objective way - I don't want to just pick the one that suits best to my expectations.