Question: How to "Randomly" Select Genes
1
gravatar for Ark
12 weeks ago by
Ark60
US
Ark60 wrote:

Hello again Biostars community!

I am working on a project for my internship regarding the efficacy of different clustering techniques on RNA-seq data. I have used many different common clustering techniques in R and compared their results and the clusters they produce. Now, I have gotten to a point in the project where I want to analyze and explicitly show how my clustering is performing compared to "random" clustering. (Where I redistribute the "cluster" label across different groups of genes.)

To this point, I have identified k-means clustering as the most helpful for my dataset and identified some positive control groups that I know are biologically meaningful and should cluster together. For these positive controls, I have calculated a p-value and this is where my "random" groups come in!

For each of my clusters, I want to redistribute my genes as randomly as possible (within the limits of reproducibility) and examine how my p-values change for the positive control groups. Ideally I would like to see the p-values for my clusters showing much higher significance than the random clusters, but who knows? (I mean... I'm pretty sure I know, but that's why I'm doing this!)

I would like each new random cluster to contain the same number of genes as the clusters that I have already generated. (e.g. Cluster_1 had 60 genes, so Random_Cluster_1 should also have 60 genes etc.). Of course, I also need to make sure any one gene is only assigned to a single cluster.

Would anyone be able to recommend a robust method of "randomly" reassigning my genes to new clusters so that I can check my clustering performance?

Any feedback would be appreciated as I don't have much experience in this area (pretty new to bioinformatics!)

Thank you!

clustering rna-seq R gene • 319 views
ADD COMMENTlink modified 7 weeks ago • written 12 weeks ago by Ark60
2

You might do a bootstrap test, sampling a subset of genes (control and treatment) and clustering them to see how error changes for given k. http://www.win-vector.com/blog/2016/02/finding-the-k-in-k-means-by-parametric-bootstrap/

ADD REPLYlink written 12 weeks ago by Alex Reynolds26k

I see, that's interesting. Earlier in my project I created an "elbow-graph" from the WSS values to attempt to gain some insight on the number of clusters I should use. It was not as helpful as I was hoping for, but I only tried it on my data, not with any "synthetic" data to compare it to. I haven't done anything like that before, but it does look very helpful. Thank you!

ADD REPLYlink written 12 weeks ago by Ark60
3
gravatar for Nicolas Rosewick
12 weeks ago by
Belgium, Brussels
Nicolas Rosewick7.0k wrote:

So you need to divide the genes (N=20000) into x clusters of different sizes. Each gene needs to be in one cluster.

Here some R solution (though not very optimal I guess - maybe there is already a function implemented in package that does that)

Example with N=20 genes and 4 clusters of size 2, 3, 5 and 10 genes.

assign_genes_randomly <- function(genes,cluster_sizes){
  random_assignment <- sample(unlist(sapply(cluster_sizes,function(x){rep(x,x)})))
  random_clusters <- sapply(cluster_sizes,function(x){genes[random_assignment==x]})
  return(random_clusters)
}

genes <- 1:20
cluster_sizes <- c(2,3,5,10)

set.seed(123)
random_clusters <- assign_genes_randomly(genes,cluster_sizes)

Results :

random_clusters
[[1]]
[1]  6 20

[[2]]
[1] 12 18 19

[[3]]
[1]  1  3  9 11 14

[[4]]
 [1]  2  4  5  7  8 10 13 15 16 17
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Nicolas Rosewick7.0k

This looks promising! Everything I had written up to this point used for loops instead of apply and I knew there must be a better/more efficient way. The combination of sample(rep()) is particularly interesting!

Thank you!

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Ark60
3
gravatar for zx8754
12 weeks ago by
zx87546.1k
London
zx87546.1k wrote:

We can shuffle genes using sample, then split:

# stealing inputs from @NicolasRosewick's answer:
genes <- 1:20
cluster_sizes <- c(2, 3, 5, 10)

# shuffle, then split
set.seed(1); random_clusters  <- split(sample(genes, length(genes)),
                                       rep(cluster_sizes, cluster_sizes))

random_clusters
# $`2`
# [1] 6 8
# 
# $`3`
# [1] 11 16  4
# 
# $`5`
# [1] 14 15  9 19  1
# 
# $`10`
# [1]  3  2 20 10  5  7 12 17 18 13
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by zx87546.1k

This is nice also. Looks a bit cleaner. Thanks!

ADD REPLYlink written 10 weeks ago by Ark60
1
gravatar for Ark
7 weeks ago by
Ark60
US
Ark60 wrote:

I ran into a problem with both of these methods when you have many clusters, so I wanted to post my fix here just in case anyone finds this thread later.

If you have many clusters, it is likely that you will have clusters of the same size. The above methods will group together everything with the same size, even if they were originally from other clusters. As an example, if two clusters had a size of "3", then they will be essentially labeled "3", and when our sampled genes are re-assigned they will make one cluster of size 6 instead of two clusters of size 3. This is not what I wanted (and honestly, I don't believe my initial question was clear enough about this, so I apologize for the confusion).

To get around this, I used @zx8754 excellent solution, but tweaked it just slightly so that I replicate based on the number of clusters I have. Something like this:

random_clusters  <- split(sample(genes, length(genes)),
                                   rep(1:number_of_clusters, cluster_sizes))

This seems to have solved the problem.

Thank you to both @NicolasRosewick and @zx8754 as I would not have been able to solve this problem without their guidance!

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Ark60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1615 users visited in the last hour