I've been using mmseqs
to remove redundancy from a large sequence database and clustering. I am going through the documentation which but it's very broad, so some of the finer scale details are missing or I can only find good examples in issues on the github page.
I am trying to use the clusthash
function to remove identical sequences before I can cluster. Here is what I've created:
mmseqs createdb ${DATA_DIR}/input.fasta ${DB_NAME}
mmseqs clusthash ${DB_NAME} ${DB_NAME}_ch --min-seq-id 1.0
mmseqs clust ${DB_NAME} ${DB_NAME}_ch ${DB_NAME}_ch_clu
mmseqs result2flat ${DB_NAME} ${DB_NAME} ${DB_NAME}_ch_clu ${DATA_DIR}/${DB_NAME}_ch_reps.fasta
mmseqs createdb ${DATA_DIR}/${DB_NAME}_ch_reps.fasta ${DB_NAME}_clusthash
This works, but it feels very circular and long-winded. Is there a way of taking the clusthash
output more directly into a mmseqs database I can use with the mmseqs cluster
function rather than wasting CPU time going through all these steps?
Particularly, when I tried to use the mmseqs cluster
function on some of these other databases, i get the error Input database "hemipteratestdb_clhash" has the wrong type (Alignment)
, or something similar. Or when it has worked, it overwrites the clusthash
clusters and clusters in a similar manner if I went from createdb
to cluster
.
Likely best posted on
mmseq
github. I don't think any authors participate here. Post the solution here if you get one.