MMSeqs2 Removing Redundancy
Entering edit mode
11 weeks ago
dthorbur ★ 2.1k

I've been using mmseqs to remove redundancy from a large sequence database and clustering. I am going through the documentation which but it's very broad, so some of the finer scale details are missing or I can only find good examples in issues on the github page.

I am trying to use the clusthash function to remove identical sequences before I can cluster. Here is what I've created:

mmseqs createdb ${DATA_DIR}/input.fasta ${DB_NAME}
mmseqs clusthash ${DB_NAME} ${DB_NAME}_ch --min-seq-id 1.0
mmseqs clust ${DB_NAME} ${DB_NAME}_ch ${DB_NAME}_ch_clu
mmseqs result2flat ${DB_NAME} ${DB_NAME} ${DB_NAME}_ch_clu ${DATA_DIR}/${DB_NAME}_ch_reps.fasta
mmseqs createdb ${DATA_DIR}/${DB_NAME}_ch_reps.fasta ${DB_NAME}_clusthash

This works, but it feels very circular and long-winded. Is there a way of taking the clusthash output more directly into a mmseqs database I can use with the mmseqs cluster function rather than wasting CPU time going through all these steps?

Particularly, when I tried to use the mmseqs cluster function on some of these other databases, i get the error Input database "hemipteratestdb_clhash" has the wrong type (Alignment), or something similar. Or when it has worked, it overwrites the clusthash clusters and clusters in a similar manner if I went from createdb to cluster.

mmseqs mmseqs2 • 240 views
Entering edit mode

Likely best posted on mmseq github. I don't think any authors participate here. Post the solution here if you get one.


Login before adding your answer.

Traffic: 1504 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6