Question: Clustering Gene Absence And Presence Data In Binary Format
2
gravatar for thakurshalabh
6.4 years ago by
Canada
thakurshalabh60 wrote:

Hi, I have gene absence and presence data for approximately 60 genomes. I have created matrix for each gene family by giving value 1 if its present and 0 if its absent. I want to cluster this data by strains which are more similar in sharing genes and also gene families which are shared in different strains. I know R can do Hierarchical clustering. But I am looking for some thing more visual such as heat map or correlation plot.

My data look like this. Any idea what method would be best to represent this data?

GROUP    Pla302278PT    Pla3988    PmaH7608    Pma90_32    PtoDC3000    PmaM6    PmaM4a    Pto1108    PtoT1    PtoK40    PtoMax13    Pav631    Pmp302280PT    Pan302091    Ptt50252    Pja301072    Ppi1704B    Pav037    Pav013    Pac302273    PacA10853    PsyB728a    PssB48    PssA2    PsCit7    Psv4352    Psv3335    PmyAZ84488    PmpFTRS_U7    Pae3681    Pae0893_23    Pla301315    Pla107    PlaYM7902    Pta11528    Pta6606    PseHC_1    Pmo301020    PgyB076    PgyUnB647    PgyBR1    PgyKN44    PgyLN10    PphY5_2    PmaKN91    PphNPS3121    PphHB10Y    Pph1448A    PmeN6801    Pph1302A    PmaYM7930    PmaES4326    Pci0788_9    Por36_1
OrthoGroup7591.1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    3    3    3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
OrthoGroup13947.1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    3    0    0
OrthoGroup6352.114    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    -1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
OrthoGroup3637.2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
clustering • 4.7k views
ADD COMMENTlink modified 6.4 years ago by Nari870 • written 6.4 years ago by thakurshalabh60

Please look carefully at your data and make sure what is shown above is the correct format (I was assuming your data is not in paragraph form). Also, just post a sample, not the whole data set.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by SES8.2k
1
gravatar for Sean Davis
6.4 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

The R heatmap function might be of use to you. Also, you might try this search:

http://www.biostars.org/search/?q=heatmap+R

ADD COMMENTlink written 6.4 years ago by Sean Davis25k
1
gravatar for aravind ramesh
6.4 years ago by
India
aravind ramesh500 wrote:

You can solve this problem in a very easy method,

I explained very briefly here, This can be useful for your analysis. You will get lot of material on various methods and algorithms in R. No need to code yourself

Step-1: Construct a formula, using which you can calculate distance(using presence of absence of data) between the genome of various strains. Calculate genome distances pair-wise.

Step-2: Subject the Distance data to PCA analysis.

Step-3: From PCA analysis output find the minimum number of dimensions which explains variablity upto atleast 80%(May be first 15-20 dimensions).

Step-4: Using any unsupervised clustering like k-means to cluster(No. of clusters should be preset, you can tune the parameters accordingly) them.

Step-5: Take cluster centers construct a phylogenetic tree.

Now you will have the releationships among the different genome(belonging to different clusters). Hope this helps

ADD COMMENTlink written 6.4 years ago by aravind ramesh500

Thank you very much. I will surely try this method.

ADD REPLYlink written 6.4 years ago by thakurshalabh60

All the best. I am curious to see your results. If possible post the final tree here.

ADD REPLYlink written 6.4 years ago by aravind ramesh500
0
gravatar for Nari
6.4 years ago by
Nari870
United States
Nari870 wrote:

1.If you are on windows, Download Past.exe
2.import data file in tab delimited text format.
3.Go to 'Multivar' menu.
4.Choose 'Cluster analysis'.
5.save cluster in nexus format or copy graphic to an image.

(Your data seems to contain 333 and -1 such numbers are confusing, it should be in 101010 format for correct clusters)

Enjoy.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Nari870

Dear @thakurshalabh, I guess either you got the answer or you don't need it anymore.

ADD REPLYlink written 6.2 years ago by Nari870
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 619 users visited in the last hour