Question

Maintaining the coverage filter in mmseqs for cascaded clustering

0

Entering edit mode

12 months ago

Victor • 0

Hi everyone,

Hopefully we have some experienced mmseqs users here who can help me with an issue in regards to cascaded clustering.

I am a fairly new user of mmseqs and have run into some unexpected behavior which I am unable to resolve. I am attempting to cluster a database of eukaryotic protein sequences (~1.8*10e6 sequences) using profile search based clustering. I am attempting to iterate or cascade the workflow described in "How to cluster using profiles". The issue I am encountering is that sequences of very different length are merged in clustering despite providing -c 0.8 --covmode 0 during the searches. This causes issues during cascaded clustering as single domain proteins are merged with multi-domain proteins.

Example output after one round of the protocol (described below, -c 0.8, --covmode 0)

For basic cascaded clustering "mmseqs cluster" or single round clustering using "search" followed by "clust" the behavior appear to function as intended. Perhaps something in the profile generation or implementation of profile against consensus searches affects the interpretation of the -c parameter? Investigating the alignment data of the attached MSA with mmseqs convertalis (attached below) shows that all hits indeed passes the -c 0.8 cutoff? As such perhaps my understanding of what constitutes alignment coverage is lacking and in that case how would one go about retricting the "coverage" to only query-target pairs with lengths within 80% of each other? I have tried --covmode 5 with similar results.

My protocol can be summarized roughly as pseudocode:

Collapse paralogs and create cluster representatives in order to reduce database redundancy using;

mmseqs cluster initial-database clusters -s 5 -c 0.8 --min-seq-id 0.9 --cluster_mode 0 --max-iterations 3 --max-seqs 100 --covmode 0

Iterate profile generation and searches of profiles against consensus sequences;

mmseqs search cluster-representatives cluster-representatives representative-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003

mmseqs result2profile cluster-representatives cluster-representatives representative-search profiles

mmseqs profile2consensus profiles initial-database consensus

mmseqs search profiles consensus profile-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003

mmseqs clust --clustermode 0 consensus profile-search profile-clusters

mmseqs createsubdb profile-clusters initial-database new-cluster-representatives

Here new-cluster-representatives are used as input to round two of searches.

Thank you for any possible help in this matter!

coverage searches mmseqs clustering sequence • 952 views

ADD COMMENT • link 12 months ago by Victor • 0

score 1 · Answer 1 · 2023-04-25

Not an expert on MMseqs2, but I have some suggestions for you. Below I leave the explanations for a couple of switches that I think are relevant.

First, I would run all of these at high sensitivity, and always use the same value. I recommend -s 7.5 for all your commands. Next, I am not sure about the reasoning for e=0.003 and it shouldn't matter a whole lot, but I'd leave it at default value -e 0.001. I think maximum number of sequences doesn't matter a whole lot once the number is in hundreds, but for maximum accuracy without cutting any corners I would put a large number there to make sure that no sequences are filtered out. Something like --max-seqs 5000. Finally, I would play with cluster mode and try --cluster-mode 2. Not sure what the difference is between 2 or 3 for that switch, but it could be worth exploring.

Beware that most of these changes will considerably slow down the clustering. The rationale is that we either want it fast or we want it good - tough to get both at the same time.

 -s FLOAT                        Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]
 --max-seqs INT                  Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [20]
 -e DOUBLE                       List matches below this E-value (range 0.0-inf) [1.000E-03]
 --cluster-mode INT              0: Set-Cover (greedy)
                                 1: Connected component (BLASTclust)
                                 2,3: Greedy clustering by sequence length (CDHIT) [0]