Question

how to cluster genes based on annotated function?

0

Entering edit mode

3.7 years ago

limchen • 0

I did pangenome analysis on around 300 bacterial genomes. Now I got the output table called gene_presence and absence.tsv. There are more than 50000 gene families in the table. can anyone recommend some tools to cluster those gene families? Otherwise, based on the current gene_presence_absence table, it is hard to do the analysis. I want to cluster those genes based on their annotated function hoping this will simplify the table, and I also want to see if these functional group will correlate with the strain groups created by their ANI values.

I know RNAseq people can run GO analysis, but mine is pangenome data. Any suggestion is welcomed?

Best, LC.

genome sequencing • 624 views

ADD COMMENT • link updated 3.7 years ago by Mensur Dlakic ★ 27k • written 3.7 years ago by limchen • 0

score 0 · Answer 1 · 2020-08-23

A common way of clustering proteins is by their similarity. There are several ways to do that, but you may want to try BLAST comparisons. Take all the proteins, compare them in all-vs-all fashion, and extract significant E-values. There is a package called MCL that works well with BLAST results and will cluster large datasets efficiently. You may also want to try OrthoMCL.