Question

Clustering sequences in a pandas dataframe

0

Entering edit mode

3 months ago

ba2so • 0

Hello,

I have a pandas dataframe containing 15 millions rows and three columns, each line corresponding to a protein (its KEGG_ID, its KO and its sequence).

I want to cluster my proteins such that they have less than 40% sequence similarity between clusters. I first thought of using the MMseq2 clustering algorithm but I realised it was only for fasta files and was not yet implemented for dataframes.

Would you have any other idea on how I could make this clustering with my tabular data?

Am I forced to convert all my rows to fasta files to perform a clustering (15M proteins would be expensive in time and in memory)?

Thanks in advance for your help and advice.

pandas python clustering • 337 views

ADD COMMENT • link updated 3 months ago by dthorbur ★ 1.9k • written 3 months ago by ba2so • 0

score 4 · Accepted Answer · 2024-01-24

MMseqs2 would still be a much simpler method than any python implementation you could write in terms of sensitivity, speed, and parameterization.

A fasta file is simply a text file with sequence names followed by sequences. You could write a simple script to write a fasta file from your dataframe (though I'm sure there are packages too) which you can then feed into MMseqs.

EDIT: If 15M is too slow, you can chunk your dataframe and run the python script in parallel. Or use packages specifically designed for speed. I'm only familiar with R, but the data.table library is exceptionally fast at reading and writing data which you could tweak for your purposes