Question

How should I analyze differences between phyletic patterns?

0

Entering edit mode

12 weeks ago

Igor • 0

I have found orthogroups (by OrthoFinder) in full archaeal proteoms of genus Halorubrum. As a result I have a dataframe with number of proteins in each orthogroup of each organism (number of orthogroup in rows and species in columns) where I have changed every number that is more than 1 to 1 to make phyletic patterns. In the end I have this dataframe: Phyletic patterns of genus Halorubrum - df 'ogroups_patterns'

There are thermophilic (['aethiopicum', 'coriense', 'tebenquichense', 'vacuolatum', 'lipolyticum', 'saccharovorum', 'terrestre', 'salsamenti','yunnanense', 'sodomense', 'distributum', 'aidingense', 'arcis']) and non-thermophilic organisms. The question is: how should I analyze this data if my goal is to find differences of thermophilic patterns in compare to non-thermophilic?

I tried to calculate Jaccard index in every orthogroup

ogroups_patterns['J'] = ogroups_patterns_terms.sum(axis = 1, numeric_only = True) / ogroups_patterns.sum(axis = 1, numeric_only = True)

where ogroups_patterns_terms is a df with phyletic patterns as in the screenshot above, but for thermophiles only

But I have no idea is this the correct way to calculate this index in this case. Maybe allowing zeros in the formula will be a good idea, but Im not sure how to code it. Every little tip would be extremely helpful, really stucked at this part and have no ideas what to do and how to code it. Bigbig thanking in advance!

phylogeny thermophiles proteoms phyletic-patterns • 393 views

ADD COMMENT • link updated 12 weeks ago by GenoMax 142k • written 12 weeks ago by Igor • 0

score 2 · Accepted Answer · 2024-02-08

If you want to cluster by organisms, I suggest you transpose the matrix so the organisms are in rows and genes in columns. Then you can apply any of the dimensionality reduction methods (PCA, t-SNE, UMAP) to reduce the dataset to 2 or 3 dimensions. If your initial hypothesis is correct, thermophiles and non-thermophiles will be in separate groups.

If you want to cluster by genes rather than by organisms, you don't need to do matrix transposition. In that case you are likely to get much more than just two clusters.

Generally speaking, any clustering method can work with the data you have, although you may need to sparsify it by converting zeros to missing values. There are many clustering techniques in python, scikit-learn package specifically.

https://scikit-learn.org/stable/modules/clustering.html

Food for thought: