Question

Grouping secondary metabolite gene clusters into families

0

Entering edit mode

7.2 years ago

kayrouz.1 • 0

I have a set of ~1000 bacterial secondary metabolite gene clusters that all encode biosynthesis of the same class of compound, based on a couple marker genes common to all clusters. However, they also have variable regions which encode a wide variety of modifying enzymes. I have nucleotide fasta files and genbank files of each cluster and they are all about 40kb in length. Does anybody know of a good way to group these gene clusters into families (GCFs)?

So far I've been able to get decent results by:

Grouping based on similarity of the marker gene sequences and using MAUVE to visualize conservation of gene context
Clustering the context genes into orthologous groups and finding which organisms have orthologs in common

My next idea is to compute a pairwise tblastx distance matrix among all 1000 gene clusters and group based on distance scores using a clustering algorithm (e.g. CLANS).

Has anyone attempted a similar task and found a more robust way of grouping into GCFs? These methods require quite a bit of manual fiddling and fail to take into account important features such as synteny. Thanks!

gene cluster grouping clustering families • 1.1k views

ADD COMMENT • link 7.2 years ago by kayrouz.1 • 0

0

Entering edit mode

CD-HIT is made for clustering sequences, give that a try.

Your dataset is going to be too large to do a standard multiple sequence alignment and hierarchical clustering approach I think, but you might be able to employ something like the mash/minhash distances between all your sequences to cluster as well.

ADD REPLY • link 7.2 years ago by Joe 22k