Question

How to cluster .faa files?

0

Entering edit mode

4 months ago

bioinformagician • 0

Hi!

I have over 100 .faa files representing different bacteria. I want to cluster bacteria according to the proteins. What sort of distance matrix can I generate between .faa files?

Any suggestion for this approach?

Thank you.

faa protein clustering amino-acids • 512 views

ADD COMMENT • link updated 4 months ago by Joe 21k • written 4 months ago by bioinformagician • 0

1

Entering edit mode

I agree with Joe, it's hard to tell what you want. But here are some general suggestions:

OrthoFinder groups proteins into orthogroups, and can infer phylogenies from protein similarities as well as generating orthogroup trees.
MMseqs2 is the current gold standard in clustering sequences.

ADD REPLY • link 4 months ago by dthorbur ★ 1.9k

0

Entering edit mode

Thank you @Joe dthorbur . I had one fasta file per bacteria, then I run prodigal to extract only the coding sequences (CDS) of a bacteria, obtaining the .faa file. Now I want to cluster the .faa files, obtain N clusters and then extract a sequence that identifies each cluster.

My goal is not to obtain phylogeny but rather to obtain a sequence that generally represents the cluster. Therefore the first step would be to identify the clusters based on similarity between sequences of each .faa.

I already though of a way to do this, that would be:

Identify the proteins that each bacteria has, annotate them
Create binary matrix with .faa file as rows and proteins as collumns. 0 if has protein 1 if yes
For each file mark with one if .faa contains protein
Apply Hclust with ward distance a group bacteria according to proteins they contain
From the proteins obtain a general sequence that represents the cluster

However this would be computacionally difficult given number of possible proteins. Columns would be very large. So i would like to cluster the sequences based on similarity distances, create clusters then get that general sequence representing the cluster.

ADD REPLY • link 4 months ago by bioinformagician • 0

0

Entering edit mode

I think you're trying to reinvent the wheel a little bit. dthorbur is on the right track - what you're essentially describing is something like wgMLST.

I would start with a tool like roary which will cluster proteins as part of its process. From those clusters you can later decide how you want to pick a representative example.

Roary will ingest annotated genomes in GFF format, so the first step will actually be to start over and generate new input files. Prodigal is good at what it does, but its its a little bit crude to treat the output as the total protein content of the bacteria. Much better to use a proper annotation pipeline like prokka.

ADD REPLY • link 4 months ago by Joe 21k

0

Entering edit mode

I think you need to clarify the question a bit.

is each .faa the proteome of a single particular bacteria?

What are you aiming to cluster by? Average sequence identity?

Tools like CD-HIT exist specifically for this, but they don't infer any phylogeny etc.

ADD REPLY • link 4 months ago by Joe 21k