I'd like to cluster genes that are co-expressed from my dataset (6000 genes X 300 conditions)
I'd like a scaling method that does not amplify noisy measurements, but still brings genes with similar gene expression profiles (but different dynamic ranges) together. Here's an example, everything is in log2 expression. I calculate FC from a reference.
g = data.frame(row.names = c('gene_1_pathway_1', 'gene_2_house_keeper', 'gene_3_noisy', 'gene_4_pathway_1'),
reference_condition = c( 5, 9, 0.01, 2),
condition_1 = c( 12, 9.001, 0.08, 5),
condition_2 = c( 3, 8.9, 0.00001, 1))
g
#> reference_condition condition_1 condition_2
#> gene_1_pathway_1 5.00 12.000 3.00000
#> gene_2_house_keeper 9.00 9.001 8.90000
#> gene_3_noisy 0.01 0.080 0.00001
#> gene_4_pathway_1 2.00 5.000 1.00000
g_FC = g - g[,'reference_condition']
g_FC = g_FC[,-1]
g_FC
#> condition_1 condition_2
#> gene_1_pathway_1 7.000 -2.00000
#> gene_2_house_keeper 0.001 -0.10000
#> gene_3_noisy 0.070 -0.00999
#> gene_4_pathway_1 3.000 -1.00000
If we look at the distances between gene1
and gene4
they are pretty bad.
dist(g_FC)
#> gene_1_pathway_1 gene_2_house_keeper gene_3_noisy
#> gene_2_house_keeper 7.2523100
#> gene_3_noisy 7.2100652 0.1134143
#> gene_4_pathway_1 4.1231056 3.1311341 3.0927366
So we try normalize by the variability of the measurement, allowing genes with the same signature but difference dynamic ranges to be similar.
g_FC_norm = t(apply(g_FC,1,function(x) x /sd(x)))
dist(g_FC_norm)
#> gene_1_pathway_1 gene_2_house_keeper gene_3_noisy
#> gene_2_house_keeper 1.53575358
#> gene_3_noisy 0.19466322 1.73041680
#> gene_4_pathway_1 0.05555556 1.48019802 0.25021878
This seemed to help for the case of the house keeper gene
, but not for the low noisy gene
.
What strategy would you recommend using? Simply filtering out lowly expressed noisy genes first? different distance metric?
I guess my question might boil down to:
I know z-scoring is a good idea, but how to avoid amplifying the signal of noisy genes?
I'm not sure from where you produced / obtained this data, but generally one should filter out transcripts that have, say, <10 raw counts across a certain proportion of samples (or that has a large amount of NA or zero counts). That, coupled with normalisation, should usually allow the avoidance of problems later down the line.
I did also think that to convert your data to the Z-scale would help. I don't see how this would necessarily exacerbate or 'amplify' the issue of the noisy gene as you call it, though(?). Z-scaling usually improves such situations from my experience on data that can be a lot messier than RNA-seq (metabolomics data).
Edit: Just to be aware too, and I'm not sure if this is creating the concern with your data or not, but housekeepers just don't behave as we would like them too. They exhibit an uncomfortable level of variation across samples and, particularly, different tissue types. Different Euclidean distance values from each of your test genes to your housekeepers is something that I would fully expect from any RNA-seq experiment.