I have groups of samples with copy-number variation (CNV) calls made based on microarray data. I am trying to determine if specific pathways are enriched with CNV for particular phenotypes. I've looked at How To Test Whether Copy Number Aberrations Are Enriched In A Gene List and other posts that describe pathway analysis from expression data. I currently have my data formatted for importing into PathVisio (tab-delimited file with genes as rows, columns are log transformed fold-change for each gene in each sample. If a gene was not overlapped by a CNV in a subject I assumed normal expression).
I have a few normal controls run with each batch, and each batch is a different phenotype. I'm trying to figure out the best way to determine if a pathway is enriched; should I compare pathway-X in sample1 to pathway-Y in sample1, should I compare pathway-X in phenotype1(all samples for a particular pathway averaged? summed?) to pathway-X in phenotype2, or should I do similar to the link above and generate random groups of genes of the same size as pathway-X and compare pathway-X in sample1 to randomly-generated-group-of-genes in sample1?
Statistics is not one of my strengths so any input is greatly appreciated.
If I understand your question correctly then I think the first thing you should do is to decide what a pathway alteration means -- and what you will do when two genes have conflicting events (a homozygous deletion on one and amplification on another). I am saying this because people have different ways of defining an alteration in pathway. I saw people do this for expression data and they simply define a "pathway activity score" by averaging over all gene expression values for each sample. You can go with a similar approach for CNV data, but you should be aware that this will not be the same as gene expression -- and hence will be really noisy. People also convert these data into a binary matrix and simply define thresholds to call CNA event as altered vs non-altered. And they then use frequency of altered samples for each of their sample groups.
I think you can instead try to do an unbiased hierarchical clustering on your gene-level data (where you remove the non-altered genes to reduce the visualization complexity) and see if the cluster tend to capture your phenotype categories. If you want to apply this on a pathway level, then you can also collapse your data to pathways (group genes into pathways) and do a clustering with these pathways. I would first have this exploratory investigation on the data and then decide how you will decide on the features (either genes/pathways) that explains each of your phenotypes.