How could I "recreate" UniRef50/UniRef90 with MMSEQS2?
1
0
Entering edit mode
14 months ago
O.rka ▴ 710

UniRef50/UniRef90 are really useful clustered databases. I'm interested in trying a similar approach to this nested clustering but with my own protein database.

Are there specific commands that were used for UniRef clustering with MMSEQS2?

I couldn't find these documented anywhere.

clustering proteins annotation mmseqs2 • 2.0k views
ADD COMMENT
1
Entering edit mode
14 months ago
Mensur Dlakic ★ 27k

All the steps are explained in the clustering guide:

Basically, we create a database index, cluster, and convert the resulting files into FASTA. These are the commands to go from 100% to 90% (and --threads should be adjusted to your system):

mmseqs createdb db.fas DB100
mmseqs cluster DB100 clu90 tmp --min-seq-id 0.90 --threads 8 -s 6 --realign 1 --remove-tmp-files
mmseqs result2repseq DB100 clu90 clu90_seq
mmseqs result2flat DB100 DB100 clu90_seq db.90 --use-fasta-header

For each subsequent step, the trick is to start from the previous database (db.90 in the example above) rather than full size, as that cuts down on clustering time.

mmseqs createdb db.90 DB90
mmseqs cluster DB90 clu50 tmp --min-seq-id 0.50 --threads 8 -s 6 --realign 1 --remove-tmp-files
mmseqs result2repseq DB90 clu50 clu50_seq
mmseqs result2flat DB90 DB90 clu50_seq db.80 --use-fasta-header

Finally:

rm -rf tmp clu* DB*

You will need a lot of disk space. You may want to try this on a smaller database to get a feel for time and disk space needs. SwissProt is about 1 million proteins, and I often use it for testing.

ADD COMMENT
0
Entering edit mode

This is awesome! Thank you. I stumbled across the "help" page https://www.uniprot.org/help/uniref which gives a general description. I've translated the description to commands using easy-cluster and easy-linclust. Does this seem to be in accord with your steps above using the more modular implementation?

mmseqs easy-cluster proteins.fasta mmseqs_100/mmseqs2 tmp --min-seq-id 1.0 -c 1.0 --cov-mode 1 --dbtype 1
seqkit seq -m 11 mmseqs_100/mmseqs2_rep_seq.fasta > mmseqs_100/mmseqs2_rep_seq.gt11.fasta
rm -rf tmp/*

mkdir -p mmseqs_90/
mmseqs easy-linclust mmseqs_100/mmseqs2_rep_seq.gt11.fasta mmseqs_90/mmseqs2 tmp --min-seq-id 0.9 -c 0.8 --cov-mode 1 --dbtype 1
rm -rf tmp/*

mkdir -p mmseqs_50/
mmseqs easy-linclust mmseqs_90/mmseqs2_rep_seq.fasta mmseqs_50/mmseqs2 tmp --min-seq-id 0.5 -c 0.8 --cov-mode 1 --dbtype 1
rm -rf tmp/*
ADD REPLY
0
Entering edit mode

I think this is a faster solution that one may need to use for 100+ million sequences. Don't know how it compares to the solution I outlined above, but one needs to balance accuracy with resources. I have done about 3.5 million sequences as described above, and I think it was about a day for the first clustering step (to 90%). Subsequent steps are faster if you start from an already clustered database.

ADD REPLY
0
Entering edit mode

You seem to be mixing and matching easy-cluster and easy-linclust. Note that these are not the same algorithms. I'm also not sure what params uniref used, but the coverage mode and clustering mode may not match what you've used here either.

ADD REPLY

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6