Question

Ranking bacterial genomes using present and absent genes information

0

Entering edit mode

5.4 years ago

Elendol • 0

Hi,

I started to work on this project a bit outside my usual area of expertise so I am not really aware of all the tools and algorithms (yet), so I thought I might find people with more experience here.

How would you rank a list of genomes according to the presence or absence of certain genes of interests? The goal would be to curate a set of genomes (thousands) based on the presence and absence of some traits (e.g. presence of some bacteriocin and absence of antibiotic resistance gene)

I don't want to filter genomes but rank them. It could also work with proteomes and on PFAM domains instead of genes

Is there a specific algorithm or software to do this job? I was thinking about counting and weighting COGs in a genome annotation file, or counting HMMs on a proteome. I did some light google searching and couldn't really find something that suited me.

genome bacteria • 805 views

ADD COMMENT • link 5.4 years ago by Elendol • 0

0

Entering edit mode

The first thing that springs to mind would be to rank the output of a pangenome/core genome tool like roary.

It will ultimately spit out a list of gene clusters, approximately of the form:

 LocusTagX, LocusTagY, LocusTagZ...

And it will do this for every gene cluster. Broadly speaking, the locus tags that appear most frequently would be highest up your ranking for presence (since each locus tag should correspond to a particular input genome).

It actually also outputs binary presence-absence alignments/trees which could be of use too.

ADD REPLY • link 5.4 years ago by Joe 22k

0

Entering edit mode

Hi Joe,

Yes that's what I do for a part on my project but working on a limited number of genomes.

The downside of this method is it's a bit "too smart" I need to load all the genomes, perform the pangenome analysis and from there analysis the locus. I was doing something much simpler that would focus only on the my genes of interests/disinterest, use some weighting systems, find them in a genome, give a score to the genome, then sort the genome by score. Eventually I would be able to add more genomes, score them, etc.

ADD REPLY • link 5.4 years ago by Elendol • 0