Question

How Does One Aggregate Figfam Ids By Subsystem?

4

Entering edit mode

14.0 years ago

Behindtherabbit ▴ 60

I have a long list of genes with corresponding FigFam IDs (e.g., FIG000176 | CTP synthase (EC 6.3.4.2)). I'd like to aggregate those genes by subsystem; in other words, I need a way to find subsystem(s) given a FigFam ID. I've searched over the NMPDR site, but to no avail. does anyone know if/how this can be done?

identifiers annotation • 5.9k views

ADD COMMENT • link updated 12.7 years ago by Katrine ▴ 20 • written 14.0 years ago by Behindtherabbit ▴ 60

score 5 · Answer 1 · 2010-05-05

Some years ago, we had a visit by Ross Overbeek, one of the masterminds behind the FIG projects. They showed us how to use the SEED to annotate their subsystems. (The user interface of that software is really confusing and chaotic, that's what they said themselves, more even than the FigFam site, but anyway....)

If you go to the Seed you can search by name eg.: CTP synthase under Searching for Genes or Functional Roles Using Text. You cannot search for these FIG000232 formatted ids though, at least I didn't find anything neither in FIG nor seed, or they did a good job in hiding that functionality.

You can also use Work on Subsystems in the SEED search form, enter an arbitrary user name. Then you get a list of sub-systems, if you click on one, you get a subsystem spreadsheet with genes and functional roles.

The identifiers you can search are for PEGs (protein encoding genes, or so). Their identifiers look different than yours (FIG000171): fig|306254.1.peg.102 If you have identifiers for pegs, also the search on the figfam search page should work. So search by name would be preferred

Another option could be to download the whole figfams from ftp://ftp.nmpdr.org/FIGfams/ (>10 GB) and search through it or to write to Ross Overbeek or other researchers in that group v(I think their name was "von Stein" or something). BTW: the project has no funding any more, so any of their services might discontinue.

score 2 · Answer 2 · 2010-06-24

2

Entering edit mode

13.8 years ago

Phil Goetz ▴ 150

You need to download the figfams file. Then look in the file families.2c . It maps eg fig|110662.3.peg.1058 to FIG000001. Then you get the fun task of mapping fig|110662.3.peg.1058 to an accession. You do that using the file relevant.peg.data . There are lots of different ways of expressing accessions in that file, and accessions from the same data source (eg swissprot, genbank) are expressed multiple ways, and some fig accessions are mapped only via the md5 checksum computed on the protein sequence. I have code to do this imperfectly.

Phil Goetz, JCVI

ADD COMMENT • link 13.8 years ago by Phil Goetz ▴ 150

0

Entering edit mode

Phil and Michael,

thanks a lot for your responses. I guess I'll look into parsing the figfams file.

cheers,

tim

ADD REPLY • link 13.8 years ago by Behindtherabbit ▴ 60

0

Entering edit mode

Hi Phil!

could you send me please your code for this mapping task?

Best,

Jhordan Alarcón

ADD REPLY • link 5.1 years ago by jhordan.rav • 0

Ram · Answer 3 · 2010-07-05

If you download the genome directory file from RAST (on the "Job Details" page, select "Genome Directory" as the download format), inside the subsystems folder you will find a file called bindings, which contains information correlating the figfam id with the susbsystem and functional role. You can later correlate the rows from this file with the FigfamID in the FIG000176 format you describe (see below). Here is top line of my genome's bindings file:

TCA_Cycle Citrate synthase (si) (EC 2.3.3.1) fig|666666.4954.peg.421

Gordon Pusch at RAST has been amazingly helpful and detailed when I ask questions; here is my exchange with him asking him to describe the subsystem/bindings file:

I have discovered that assigning the FIG numbers to a subsystem is not trivial (i.e. FIG133002). Perhaps I should be using the information in the Subsystems>bindings file? Can I ask you to tell me what is contained in this file?

It has 1443 lines (my genome has 1664 predicted CDS and 984 are included in subsystems), and the first column is a description of a kind of gene category I think (i.e. TCA-cycle, glutoredoxins...). Many pegs seem to be repeated a few times (between 2-6, from a quick grepping around).

Do you know why there are repeats?

Is there a list somewhere of how the descriptors in the first column of the bindings file fit into the subsystems, without having to use the RAST website and click for every category?

Gordon's answer: RE: the 'Subsystems/bindings' file:

1.) The first column is the subsystem name; 2.) The second column is the PEg's functional role within the subsystem; 3.) The thirds column is the FIG identifier of the PEG.

The "repeats" occur because subsystems are allowed to "overlap," i.e., a given PEG may participate in more than one subsystem.

The complete table of subsystems and the functional roles within them may be downloaded from <ftp://ftp.nmpdr.org/subsystems/subsys.txt>. This table contains additional columns grouping the subsystems into categories and subcategories.

Another useful piece of info from Gordon:

I have a (somewhat outdated) webpage describing the contents of a SEED format genome directory as of two years ago:

<http://microbe.cs.niu.edu/biodocs/Class_Notes/SEED_overview.html>;

skip down to section "Structure of a Genome Directory."

You can link the FigfamID with the fig|##.peg.# kind of ID, using a file in the Genome Directory called "found" for the genes that have been included in subsystems. There may be other files with the same info, but here is an excerpt from Subsystems>found:

fig|666666.4954.peg.4 FIG133002 Pyridoxamine 5'-phosphate oxidase (EC 1.4.3.5)

fig|666666.4954.peg.7 FIG000635 MG(2+) CHELATASE FAMILY PROTEIN / ComM-related protein

and "proposed_non_ff_functions" for the genes that were not included in subsystems:

fig|666666.4954.peg.1 Autotransporter adhesin fig|666666.4954.peg.2 hypothetical protein fig|666666.4954.peg.3 hypothetical protein fig|666666.4954.peg.5 putative monooxygenase component

I am still a bit stuck on how to handle the fact that there are repeats from the overlapping presence of predicted genes in more than one subsystem, but one good thing if you are comparing multiple genomes annotated in the same way is that they will hopefully have similar overlaps. This may be how I justify comparisons of genomes annotated this way...

Did the original person who asked this question (behind the rabbit) find a solution? Any other ideas for how to handle this?

Katrine Whiteson, University of Geneva Hospitals, Genomic Research Lab