I'd like to cluster genes that are co-expressed from my dataset (6000 genes X 300 conditions)
I'd like a scaling method that does not amplify noisy measurements, but still brings genes with similar gene expression profiles (but different dynamic ranges) together. Here's an example, everything is in log2 expression. I calculate FC from a reference.
g = data.frame(row.names = c('gene_1_pathway_1', 'gene_2_house_keeper', 'gene_3_noisy', 'gene_4_pathway_1'), reference_condition = c( 5, 9, 0.01, 2), condition_1 = c( 12, 9.001, 0.08, 5), condition_2 = c( 3, 8.9, 0.00001, 1)) g #> reference_condition condition_1 condition_2 #> gene_1_pathway_1 5.00 12.000 3.00000 #> gene_2_house_keeper 9.00 9.001 8.90000 #> gene_3_noisy 0.01 0.080 0.00001 #> gene_4_pathway_1 2.00 5.000 1.00000 g_FC = g - g[,'reference_condition'] g_FC = g_FC[,-1] g_FC #> condition_1 condition_2 #> gene_1_pathway_1 7.000 -2.00000 #> gene_2_house_keeper 0.001 -0.10000 #> gene_3_noisy 0.070 -0.00999 #> gene_4_pathway_1 3.000 -1.00000
If we look at the distances between
gene4 they are pretty bad.
dist(g_FC) #> gene_1_pathway_1 gene_2_house_keeper gene_3_noisy #> gene_2_house_keeper 7.2523100 #> gene_3_noisy 7.2100652 0.1134143 #> gene_4_pathway_1 4.1231056 3.1311341 3.0927366
So we try normalize by the variability of the measurement, allowing genes with the same signature but difference dynamic ranges to be similar.
g_FC_norm = t(apply(g_FC,1,function(x) x /sd(x))) dist(g_FC_norm) #> gene_1_pathway_1 gene_2_house_keeper gene_3_noisy #> gene_2_house_keeper 1.53575358 #> gene_3_noisy 0.19466322 1.73041680 #> gene_4_pathway_1 0.05555556 1.48019802 0.25021878
This seemed to help for the case of the
house keeper gene, but not for the low
What strategy would you recommend using? Simply filtering out lowly expressed noisy genes first? different distance metric?