Hi,
I started to work on this project a bit outside my usual area of expertise so I am not really aware of all the tools and algorithms (yet), so I thought I might find people with more experience here.
How would you rank a list of genomes according to the presence or absence of certain genes of interests? The goal would be to curate a set of genomes (thousands) based on the presence and absence of some traits (e.g. presence of some bacteriocin and absence of antibiotic resistance gene)
I don't want to filter genomes but rank them. It could also work with proteomes and on PFAM domains instead of genes
Is there a specific algorithm or software to do this job? I was thinking about counting and weighting COGs in a genome annotation file, or counting HMMs on a proteome. I did some light google searching and couldn't really find something that suited me.
The first thing that springs to mind would be to rank the output of a pangenome/core genome tool like
roary
.It will ultimately spit out a list of gene clusters, approximately of the form:
And it will do this for every gene cluster. Broadly speaking, the locus tags that appear most frequently would be highest up your ranking for presence (since each locus tag should correspond to a particular input genome).
It actually also outputs binary presence-absence alignments/trees which could be of use too.
Hi Joe,
Yes that's what I do for a part on my project but working on a limited number of genomes.
The downside of this method is it's a bit "too smart" I need to load all the genomes, perform the pangenome analysis and from there analysis the locus. I was doing something much simpler that would focus only on the my genes of interests/disinterest, use some weighting systems, find them in a genome, give a score to the genome, then sort the genome by score. Eventually I would be able to add more genomes, score them, etc.