Question

Random gene selection based on gene metrics such as length, GC content, replication time, etc...

1

Entering edit mode

10.3 years ago

darbrob ▴ 10

Hello,

Thank you in advance for anyone who considers this question.

I'm doing some gene set enrichment analysis and have a list of 100 genes I have curated from the literature. This list of genes is very specific for genes involved in a type of chromatin remodeling. I have performed a type of gene set enrichment using this list and found significance enrichment for rare, coding variants in these genes within a disease cohort compared to a control cohort.

What I would like to do now is create several additional control gene sets to see if the enrichment is specific to the chromatin remodeling gene set. What I would like to do is create several gene sets of the same approximately size (about 100 genes) that have overall set metrics the same as my chromatin remodeling set.

For instance, in my chromatin remodeling gene set the average gene length is approximately 72kb. What I'd really like to do is create several additional control gene lists of about the same number (~100) that also have an average gene length of above 72kb. I would also like to extend this to other gene metrics as well such as GC content, replication time, and others.

Does anyone know of an existing program/script that would do this if provided with a gene list that included all the gene lengths, GC content, replication time, etc.. included in it?

Thanks again for thinking about this - it would be a big help.

gene genome sequence • 3.1k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by darbrob ▴ 10

Ram · Answer 1 · 2015-04-01

1

Entering edit mode

10.3 years ago

TriS ★ 4.8k

Here is an example of how you can do it in R given that you have the info that you need

# create some random names
name <- c()
for (i in 1:400){
  temp <- paste(sample(c(0:9, LETTERS), size = 3, replace = T), sample(c(0:9, LETTERS), size = 3, replace = T), sep="_")
  name <- c(name, temp)
}

# create some random size, GC content, replication time
size <- rnorm(1200,mean = 72,sd = 40)
gc <- rnorm(1200,mean = 50,sd = 30)
rep_time_sec <- rnorm(1200,mean = 20,sd = 10)

# save everythign in a data frame
all <- data.frame(name, size, gc, rep_time_sec)

# find out where the elements that you want are
ind_size <- which(all$size > 72) # size bigger than 72kb
ind_gc <- which(all$gc < 70) # GC lower than 70%
ind_rep_time <- which(all$rep_time_sec < 19) # replication time less than 29 sec 

# now find which gene names have ALL of the above attributes
all_filtered <- Reduce(intersect, list(all[ind_size,1], all[ind_gc,1], all[ind_rep_time,1]))

# and randomly select 3 sets of 100 genes each and save the names as set_1, set_2 and set_3...all set :)
for (i in 1:3){
  set_random <- sample(all_filtered,size = 100)
  assign(paste("set_",i,sep=""),set_random)
}

So now you have 3 sets of random genes with the attributes that you want. keep in mind that the approach above can lead to overlapping gene names in the different random sets

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by TriS ★ 4.8k

0

Entering edit mode

This is pretty close to what I'm looking for - thanks! However, as I understand this code, the lists that are being generated are simply genes that meet a static criteria such as gene size > 72, GC < 70, etc...

What I'm trying to end up with are gene sets where the average gene length for all the genes in the random set is approximately 72 and so on for GC content, replication time, etc...

Is there a variation on this code in R that can do this? Thanks so much for your help!

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by darbrob ▴ 10

0

Entering edit mode

You can do that too by creating the boundaries with the rnorm() function, in this way you give a range but you don't specify a hard threshold. i.e. for size it could be something like (peeking to the code above):

# create some random size, GC content, replication time
size <- rnorm(1200,mean = 72,sd = 40)
ind_size <- which(all$size > min(size) & all$size < max(size))
my_random_set_av_72 <- sample(all[ind_size,], size=100)

now you have a random gene set of 100 genes whose average-ish size is 72. you can play with the mean and sd values to get the boundaries that you need/want

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by TriS ★ 4.8k

Ram · Answer 2 · 2015-04-01

Probably something you just code up yourself.

The basis of it should be simple enough in something like python. Your only issue may be that the numbers e.g. 72kb might have to have softer boundaries e.g. 70-74kb - depending on your dataset.

Store genes in a dict with a list of properties as the value. Just create a permutation or properties that adds up to your desired properties and store the list of these genes. Put it into a while loop that ends when your have X control gene lists.

Do up the code best you can and come back if you need more help, but do some code first maybe?

R