Maintaining the coverage filter in mmseqs for cascaded clustering
1
0
Entering edit mode
12 months ago
Victor • 0

Hi everyone,

Hopefully we have some experienced mmseqs users here who can help me with an issue in regards to cascaded clustering.

I am a fairly new user of mmseqs and have run into some unexpected behavior which I am unable to resolve. I am attempting to cluster a database of eukaryotic protein sequences (~1.8*10e6 sequences) using profile search based clustering. I am attempting to iterate or cascade the workflow described in "How to cluster using profiles". The issue I am encountering is that sequences of very different length are merged in clustering despite providing -c 0.8 --covmode 0 during the searches. This causes issues during cascaded clustering as single domain proteins are merged with multi-domain proteins.

Example output after one round of the protocol (described below, -c 0.8, --covmode 0) 1

For basic cascaded clustering "mmseqs cluster" or single round clustering using "search" followed by "clust" the behavior appear to function as intended. Perhaps something in the profile generation or implementation of profile against consensus searches affects the interpretation of the -c parameter? Investigating the alignment data of the attached MSA with mmseqs convertalis (attached below) shows that all hits indeed passes the -c 0.8 cutoff? As such perhaps my understanding of what constitutes alignment coverage is lacking and in that case how would one go about retricting the "coverage" to only query-target pairs with lengths within 80% of each other? I have tried --covmode 5 with similar results.

My protocol can be summarized roughly as pseudocode:

  1. Collapse paralogs and create cluster representatives in order to reduce database redundancy using;

mmseqs cluster initial-database clusters -s 5 -c 0.8 --min-seq-id 0.9 --cluster_mode 0 --max-iterations 3 --max-seqs 100 --covmode 0

  1. Iterate profile generation and searches of profiles against consensus sequences;

mmseqs search cluster-representatives cluster-representatives representative-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003

mmseqs result2profile cluster-representatives cluster-representatives representative-search profiles

mmseqs profile2consensus profiles initial-database consensus

mmseqs search profiles consensus profile-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003

mmseqs clust --clustermode 0 consensus profile-search profile-clusters

mmseqs createsubdb profile-clusters initial-database new-cluster-representatives

Here new-cluster-representatives are used as input to round two of searches.

Thank you for any possible help in this matter!

coverage searches mmseqs clustering sequence • 952 views
ADD COMMENT
1
Entering edit mode
12 months ago
Mensur Dlakic ★ 27k

Not an expert on MMseqs2, but I have some suggestions for you. Below I leave the explanations for a couple of switches that I think are relevant.

First, I would run all of these at high sensitivity, and always use the same value. I recommend -s 7.5 for all your commands. Next, I am not sure about the reasoning for e=0.003 and it shouldn't matter a whole lot, but I'd leave it at default value -e 0.001. I think maximum number of sequences doesn't matter a whole lot once the number is in hundreds, but for maximum accuracy without cutting any corners I would put a large number there to make sure that no sequences are filtered out. Something like --max-seqs 5000. Finally, I would play with cluster mode and try --cluster-mode 2. Not sure what the difference is between 2 or 3 for that switch, but it could be worth exploring.

Beware that most of these changes will considerably slow down the clustering. The rationale is that we either want it fast or we want it good - tough to get both at the same time.

 -s FLOAT                        Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]
 --max-seqs INT                  Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [20]
 -e DOUBLE                       List matches below this E-value (range 0.0-inf) [1.000E-03]
 --cluster-mode INT              0: Set-Cover (greedy)
                                 1: Connected component (BLASTclust)
                                 2,3: Greedy clustering by sequence length (CDHIT) [0]
ADD COMMENT
0
Entering edit mode

Hi Mensur,

Thank you for the quick answer.

Turns out as more often than not the issue here is an error in implementation rather than software. There is a difference in the alignment information contained within the resulting cluster databases and their corresponding alignment databases. Generating a MSA from a cluster database will only display a part of the alignments. I do not understand what part determines what is displayed however. Due to this discrepancy the MSAs look far more fragmented than they are in reality as validated by taking subsets from the alignment data or the original sequence database and then aligning that separately.

As such this is not a real issue, apologies for the confusion and thanks for your time.

I will delete this question as not to confuse people.

Best, Victor

ADD REPLY

Login before adding your answer.

Traffic: 2774 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6