Question

Clustering similiar genes from many genomes

0

Entering edit mode

3.5 years ago

anabaena ▴ 10

Hello all, I am working on elucidating a biochemical pathway. What I want to do is take all genomes that have this metabolic pathway and find what genes these genomes share. The purpose is this pathway is seemingly transferred horizontally in prokaryotes and I want to find what the possible 'prerequisite genes' are for this pathway to work in new hosts such as cofactor synthesis, transport proteins, etc that may lie outside of the island. Has anyone done anything similar and figured out a good approach?

My initial thoughts were to simply examine feature tables and create a venn diagram of those sharing similiar features, but many of the genes in the island itself are poorly annotated so I need to do some form of clustering based on sequence identity/similarity.

Python genome pangenomics clustering • 850 views

ADD COMMENT • link updated 3.4 years ago by Joe 21k • written 3.5 years ago by anabaena ▴ 10

1

Entering edit mode

You can use cd-hit to cluster based on similarity of sequences.

If you have several gene clustering in gbk format, and want to compare them, you can use clinker

ADD REPLY • link 3.4 years ago by Fatima ▴ 1000

0

Entering edit mode

3.4 years ago

Joe 21k

Depending on how closely related all the genomes are, you could use a pangenome approach like roary. If the genomes are quite diverse, you can use other pangenome tools (there are some designed for broader comparisons though none of the names spring to mind at present).

ADD COMMENT • link 3.4 years ago by Joe 21k

score 2 · Accepted Answer · 2020-11-11

One option is to go to STRING database and search for one of the proteins that are in that pathway. That will initially give you proteins that are interacting partners of your query, which may be enough for your purposes. You can also choose the neighborhood option and that will give you conserved clusters of genes in various organisms, which may answer the question how conserved is that protein and its neighbors across multiple genomes. You may need to start multiple times using different query sequences to get a complete picture.