Grouping secondary metabolite gene clusters into families
Entering edit mode
3.5 years ago
kayrouz.1 • 0

I have a set of ~1000 bacterial secondary metabolite gene clusters that all encode biosynthesis of the same class of compound, based on a couple marker genes common to all clusters. However, they also have variable regions which encode a wide variety of modifying enzymes. I have nucleotide fasta files and genbank files of each cluster and they are all about 40kb in length. Does anybody know of a good way to group these gene clusters into families (GCFs)?

So far I've been able to get decent results by:

  1. Grouping based on similarity of the marker gene sequences and using MAUVE to visualize conservation of gene context
  2. Clustering the context genes into orthologous groups and finding which organisms have orthologs in common

My next idea is to compute a pairwise tblastx distance matrix among all 1000 gene clusters and group based on distance scores using a clustering algorithm (e.g. CLANS).

Has anyone attempted a similar task and found a more robust way of grouping into GCFs? These methods require quite a bit of manual fiddling and fail to take into account important features such as synteny. Thanks!

gene cluster grouping clustering families • 652 views
Entering edit mode

CD-HIT is made for clustering sequences, give that a try.

Your dataset is going to be too large to do a standard multiple sequence alignment and hierarchical clustering approach I think, but you might be able to employ something like the mash/minhash distances between all your sequences to cluster as well.


Login before adding your answer.

Traffic: 1781 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6