I'm training to become a bioinformatician in a lab where everyone else is a traditional wet lab scientist with little math/analysis/programming background. Recently, I did an RNA-Seq experiment on three tissue types - CNV deletion, CNV duplication, and control CNV - and I applied a PCA as an exploratory analysis to see if the CNV has sufficient impact on the transcriptome of the samples that CNV-DEL, CNV-DUP, and CNV-CTRL cluster separately.
When he saw my PCA plot, a labmate asked what a principal component is. Here's how I tried to explain it: A principal component is like a line composed of many variables (in this cases, the 171,000 transcripts in my data) that captures an element of variation in the data. Each principal component has a unique pattern of weights assigned to each variable, and these weights are designed such that the PCs are uncorrelated to each other - that is, the PCs are orthogonal. BUT - the genes with the highest weight in the PC are not inherently biologically meaningful.
My labmates insist that the principal components must contain biologically relevant information about relationships between genes. I think their idea is that I could list which genes have the highest weights for a given PC and then submit them to DAVID and then conclude "genes involved in metabolic pathways are responsible for the greatest variation in our data."
I think their logic is like this: In our data, we expect CNV status will produce most of the variation in expression data between samples. The first principal component should describe which genes are most involved in producing the most variation between samples. Therefore, the genes heavily weighted in the first principal component should be somehow related to the genes affected by CNV status.
What's missing/wrong from what I'm saying about principal components? How can I explain the appropriate interpretation of PCA to my lab? What alternative analyses I could apply if I want to make a claim about underlying patterns of gene expression that distinguish the three conditions?