Clustering Gene Absence And Presence Data In Binary Format
3
4
Entering edit mode
11.1 years ago

Hi, I have gene absence and presence data for approximately 60 genomes. I have created matrix for each gene family by giving value 1 if its present and 0 if its absent. I want to cluster this data by strains which are more similar in sharing genes and also gene families which are shared in different strains. I know R can do Hierarchical clustering. But I am looking for some thing more visual such as heat map or correlation plot.

My data look like this. Any idea what method would be best to represent this data?

GROUP    Pla302278PT    Pla3988    PmaH7608    Pma90_32    PtoDC3000    PmaM6    PmaM4a    Pto1108    PtoT1    PtoK40    PtoMax13    Pav631    Pmp302280PT    Pan302091    Ptt50252    Pja301072    Ppi1704B    Pav037    Pav013    Pac302273    PacA10853    PsyB728a    PssB48    PssA2    PsCit7    Psv4352    Psv3335    PmyAZ84488    PmpFTRS_U7    Pae3681    Pae0893_23    Pla301315    Pla107    PlaYM7902    Pta11528    Pta6606    PseHC_1    Pmo301020    PgyB076    PgyUnB647    PgyBR1    PgyKN44    PgyLN10    PphY5_2    PmaKN91    PphNPS3121    PphHB10Y    Pph1448A    PmeN6801    Pph1302A    PmaYM7930    PmaES4326    Pci0788_9    Por36_1
OrthoGroup7591.1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    3    3    3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
OrthoGroup13947.1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    3    0    0
OrthoGroup6352.114    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    -1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
OrthoGroup3637.2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
clustering • 7.6k views
ADD COMMENT
0
Entering edit mode

Please look carefully at your data and make sure what is shown above is the correct format (I was assuming your data is not in paragraph form). Also, just post a sample, not the whole data set.

ADD REPLY
2
Entering edit mode
11.1 years ago

You can solve this problem in a very easy method,

I explained very briefly here, This can be useful for your analysis. You will get lot of material on various methods and algorithms in R. No need to code yourself

Step-1: Construct a formula, using which you can calculate distance(using presence of absence of data) between the genome of various strains. Calculate genome distances pair-wise.

Step-2: Subject the Distance data to PCA analysis.

Step-3: From PCA analysis output find the minimum number of dimensions which explains variablity upto atleast 80%(May be first 15-20 dimensions).

Step-4: Using any unsupervised clustering like k-means to cluster(No. of clusters should be preset, you can tune the parameters accordingly) them.

Step-5: Take cluster centers construct a phylogenetic tree.

Now you will have the releationships among the different genome(belonging to different clusters). Hope this helps

ADD COMMENT
0
Entering edit mode

Thank you very much. I will surely try this method.

ADD REPLY
0
Entering edit mode

All the best. I am curious to see your results. If possible post the final tree here.

ADD REPLY
1
Entering edit mode
11.1 years ago

The R heatmap function might be of use to you. Also, you might try this search:

http://www.biostars.org/search/?q=heatmap+R

ADD COMMENT
0
Entering edit mode
11.1 years ago
Naren ▴ 1000

1.If you are on windows, Download Past.exe
2.import data file in tab delimited text format.
3.Go to 'Multivar' menu.
4.Choose 'Cluster analysis'.
5.save cluster in nexus format or copy graphic to an image.

(Your data seems to contain 333 and -1 such numbers are confusing, it should be in 101010 format for correct clusters)

Enjoy.

ADD COMMENT
0
Entering edit mode

Dear @thakurshalabh, I guess either you got the answer or you don't need it anymore.

ADD REPLY

Login before adding your answer.

Traffic: 2131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6