Clustering sequences in a pandas dataframe
1
0
Entering edit mode
3 months ago
ba2so • 0

Hello,

I have a pandas dataframe containing 15 millions rows and three columns, each line corresponding to a protein (its KEGG_ID, its KO and its sequence).

I want to cluster my proteins such that they have less than 40% sequence similarity between clusters. I first thought of using the MMseq2 clustering algorithm but I realised it was only for fasta files and was not yet implemented for dataframes.

  • Would you have any other idea on how I could make this clustering with my tabular data?
  • Am I forced to convert all my rows to fasta files to perform a clustering (15M proteins would be expensive in time and in memory)?

Thanks in advance for your help and advice.

pandas python clustering • 337 views
ADD COMMENT
4
Entering edit mode
3 months ago
dthorbur ★ 1.9k

MMseqs2 would still be a much simpler method than any python implementation you could write in terms of sensitivity, speed, and parameterization.

A fasta file is simply a text file with sequence names followed by sequences. You could write a simple script to write a fasta file from your dataframe (though I'm sure there are packages too) which you can then feed into MMseqs.

EDIT: If 15M is too slow, you can chunk your dataframe and run the python script in parallel. Or use packages specifically designed for speed. I'm only familiar with R, but the data.table library is exceptionally fast at reading and writing data which you could tweak for your purposes

ADD COMMENT

Login before adding your answer.

Traffic: 1589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6