MMSeqs2 Removing Redundancy
0
0
Entering edit mode
27 days ago
dthorbur ★ 1.9k

I've been using mmseqs to remove redundancy from a large sequence database and clustering. I am going through the documentation which but it's very broad, so some of the finer scale details are missing or I can only find good examples in issues on the github page.

I am trying to use the clusthash function to remove identical sequences before I can cluster. Here is what I've created:

mmseqs createdb ${DATA_DIR}/input.fasta ${DB_NAME}
mmseqs clusthash ${DB_NAME} ${DB_NAME}_ch --min-seq-id 1.0
mmseqs clust ${DB_NAME} ${DB_NAME}_ch ${DB_NAME}_ch_clu
mmseqs result2flat ${DB_NAME} ${DB_NAME} ${DB_NAME}_ch_clu ${DATA_DIR}/${DB_NAME}_ch_reps.fasta
mmseqs createdb ${DATA_DIR}/${DB_NAME}_ch_reps.fasta ${DB_NAME}_clusthash

This works, but it feels very circular and long-winded. Is there a way of taking the clusthash output more directly into a mmseqs database I can use with the mmseqs cluster function rather than wasting CPU time going through all these steps?

Particularly, when I tried to use the mmseqs cluster function on some of these other databases, i get the error Input database "hemipteratestdb_clhash" has the wrong type (Alignment), or something similar. Or when it has worked, it overwrites the clusthash clusters and clusters in a similar manner if I went from createdb to cluster.

mmseqs mmseqs2 • 164 views
ADD COMMENT
0
Entering edit mode

Likely best posted on mmseq github. I don't think any authors participate here. Post the solution here if you get one.

ADD REPLY

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6