Hi everyone,
Hopefully we have some experienced mmseqs users here who can help me with an issue in regards to cascaded clustering.
I am a fairly new user of mmseqs and have run into some unexpected behavior which I am unable to resolve. I am attempting to cluster a database of eukaryotic protein sequences (~1.8*10e6 sequences) using profile search based clustering. I am attempting to iterate or cascade the workflow described in "How to cluster using profiles". The issue I am encountering is that sequences of very different length are merged in clustering despite providing -c 0.8 --covmode 0 during the searches. This causes issues during cascaded clustering as single domain proteins are merged with multi-domain proteins.
Example output after one round of the protocol (described below, -c 0.8, --covmode 0)
For basic cascaded clustering "mmseqs cluster" or single round clustering using "search" followed by "clust" the behavior appear to function as intended. Perhaps something in the profile generation or implementation of profile against consensus searches affects the interpretation of the -c parameter? Investigating the alignment data of the attached MSA with mmseqs convertalis (attached below) shows that all hits indeed passes the -c 0.8 cutoff? As such perhaps my understanding of what constitutes alignment coverage is lacking and in that case how would one go about retricting the "coverage" to only query-target pairs with lengths within 80% of each other? I have tried --covmode 5 with similar results.
My protocol can be summarized roughly as pseudocode:
- Collapse paralogs and create cluster representatives in order to reduce database redundancy using;
mmseqs cluster initial-database clusters -s 5 -c 0.8 --min-seq-id 0.9 --cluster_mode 0 --max-iterations 3 --max-seqs 100 --covmode 0
- Iterate profile generation and searches of profiles against consensus sequences;
mmseqs search cluster-representatives cluster-representatives representative-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003
mmseqs result2profile cluster-representatives cluster-representatives representative-search profiles
mmseqs profile2consensus profiles initial-database consensus
mmseqs search profiles consensus profile-search -s 7 -c 0.8 --covmode 0 --maxseqs 300 -e 0.003
mmseqs clust --clustermode 0 consensus profile-search profile-clusters
mmseqs createsubdb profile-clusters initial-database new-cluster-representatives
Here new-cluster-representatives are used as input to round two of searches.
Thank you for any possible help in this matter!
Hi Mensur,
Thank you for the quick answer.
Turns out as more often than not the issue here is an error in implementation rather than software. There is a difference in the alignment information contained within the resulting cluster databases and their corresponding alignment databases. Generating a MSA from a cluster database will only display a part of the alignments. I do not understand what part determines what is displayed however. Due to this discrepancy the MSAs look far more fragmented than they are in reality as validated by taking subsets from the alignment data or the original sequence database and then aligning that separately.
As such this is not a real issue, apologies for the confusion and thanks for your time.
I will delete this question as not to confuse people.
Best, Victor