Hello,
I have a pandas dataframe containing 15 millions rows and three columns, each line corresponding to a protein (its KEGG_ID, its KO and its sequence).
I want to cluster my proteins such that they have less than 40% sequence similarity between clusters. I first thought of using the MMseq2 clustering algorithm but I realised it was only for fasta files and was not yet implemented for dataframes.
- Would you have any other idea on how I could make this clustering with my tabular data?
- Am I forced to convert all my rows to fasta files to perform a clustering (15M proteins would be expensive in time and in memory)?
Thanks in advance for your help and advice.