Question

MMSeqs2 Removing Redundancy

0

Entering edit mode

4 months ago

dthorbur ★ 2.4k

I've been using mmseqs to remove redundancy from a large sequence database and clustering. I am going through the documentation which but it's very broad, so some of the finer scale details are missing or I can only find good examples in issues on the github page.

I am trying to use the clusthash function to remove identical sequences before I can cluster. Here is what I've created:

mmseqs createdb ${DATA_DIR}/input.fasta ${DB_NAME}
mmseqs clusthash ${DB_NAME} ${DB_NAME}_ch --min-seq-id 1.0
mmseqs clust ${DB_NAME} ${DB_NAME}_ch ${DB_NAME}_ch_clu
mmseqs result2flat ${DB_NAME} ${DB_NAME} ${DB_NAME}_ch_clu ${DATA_DIR}/${DB_NAME}_ch_reps.fasta
mmseqs createdb ${DATA_DIR}/${DB_NAME}_ch_reps.fasta ${DB_NAME}_clusthash

This works, but it feels very circular and long-winded. Is there a way of taking the clusthash output more directly into a mmseqs database I can use with the mmseqs cluster function rather than wasting CPU time going through all these steps?

Particularly, when I tried to use the mmseqs cluster function on some of these other databases, i get the error Input database "hemipteratestdb_clhash" has the wrong type (Alignment), or something similar. Or when it has worked, it overwrites the clusthash clusters and clusters in a similar manner if I went from createdb to cluster.

mmseqs mmseqs2 • 343 views

ADD COMMENT • link updated 4 months ago by GenoMax 145k • written 4 months ago by dthorbur ★ 2.4k

0

Entering edit mode

Likely best posted on mmseq github. I don't think any authors participate here. Post the solution here if you get one.

ADD REPLY • link 4 months ago by GenoMax 145k