Question: Script for Gene Clustering.
0
gravatar for talalamin
22 months ago by
talalamin0
talalamin0 wrote:

I have multiple genes and their co-ordinates. I want to calculate distance between genes and cluster those genes which are below threshold value. I am trying to it manually. Is there any program or script to do that except bedtools? Thanks

distance merge cluster gene • 672 views
ADD COMMENTlink modified 22 months ago by Ben50 • written 22 months ago by talalamin0

Please be elaborative with your question and what kind data you have. Try to put the data snippet. If you want to perform clustering of your genes based on unsupervised hierarchical clustering methods then you can calculate the pairwise distances and plot a dendrogram. you can do that in R. Look for methods like complete linkage or ward.D2. Try to understand how the hclust function works if the intention is to cluster all of them. Now when you say you want to cluster only those genes below a specific threshold, you have to clarify on what basis you are calculating your threshold and why do you want to do that. You clustering will change obviously when you remove observations but is it the correct way to do that? Be more descriptive and then we can help you more. Thanks

ADD REPLYlink written 22 months ago by ivivek_ngs4.7k

It is unclear what you are asking, it seems though you want to group genes based on their genomic location, like in a gene cluster? What is the purpose of this approach?

What is the distance of genes on different chromosomes? Would it be better to use genetic distance in cM?

ADD REPLYlink written 22 months ago by Michael Dondrup45k

Yes exactly this is what i want. I want to make a syntenic region. And later on want to apply it to enhancers, so i can compare it later with other genomic location.

ADD REPLYlink written 22 months ago by talalamin0

Thank you guys for your quick reply. Here is my problem in detail.

I have 50+ genes. And their coordinates (location on chromosome) like

  1. 32889611-32974403
  2. 33134735-33219529
  3. 33304325-33389119

I want to find distance (in terms of location on chromosome) between these genes. And if distance is less than 20000. I want to cluster those genes.

  • For example

I calculated distance (difference) between Gene2 and Gene1 (32974403 - 33134735) and that is 160332 . So i want to calculate distance (difference) between all genes. In next step i want to take only those genes (original coordinates) who have difference less than 20000 and put them together. For example if distance between 2 or more than 2 genes is within range then place them like this 32889611-32974403---33134735-33219529--33304325-33389119.

I am able to do it manually. But no luck doing it automatically except BedTool. Thanks for your help.

ADD REPLYlink written 22 months ago by talalamin0

That would be simply the minimum difference of start/end - start/end, I think you can do this in Excel even. What kind of script would you need?

ADD REPLYlink written 22 months ago by Michael Dondrup45k

Yes I am already doing it manually on Excel. But i want to do it with help of Perl or R. I just want to re-confirm my results generated from Bedtools.

ADD REPLYlink written 22 months ago by talalamin0

maybe you want to cluster multiple genes with expression levels, not physical distance

ADD REPLYlink written 22 months ago by Ben50

Gene clusters could be defined on physical or genetic distance, among other things. Whether gene clusters are relevant or even exist can be discussed (see e.g. http://dev.biologists.org/content/134/14/2549 ). To answer this, one could start from determining whether genes are located within a certain distance from each other, however it might be better to compare multiple organisms, and not look at a single organism. Then we look at whether a gene organization is conserved between different organisms.

ADD REPLYlink modified 22 months ago • written 22 months ago by Michael Dondrup45k
1
gravatar for Michael Dondrup
22 months ago by
Bergen, Norway
Michael Dondrup45k wrote:

Possibly, easiest to do in R, along the lines (untested):

# read the matrix into R, depends on your format, export 2 columns to csv in excel
gene.coords = read.csv(...)
gene.coords = t(apply(gene.coords, 1, sort)) # make sure start < stop   
my.gene.dist = apply(gene.coords, 1, function(x) {apply(gene.coords, 1, function(y) min(abs(c( x-y,(x[1]-y[2]), (x[2]-y[1]))) ))}) # get the minimum distance matrix of either start-start, end-end
ADD COMMENTlink modified 22 months ago • written 22 months ago by Michael Dondrup45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1039 users visited in the last hour