Question

Clustering Genes In Go Categories

1

Entering edit mode

11.7 years ago

Assa Yeroslaviz ★ 1.8k

I am trying to cluster genes in separate GO categories in the following way:

input data (in a tab-delimited format):

id    flybasename_gene    flybase_gene_id    entrezgene    GOMF               
1616608_a_at    Gpdh    FBgn0001128    33824    carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide phosphodiesterase activity:protein binding
1622892_s_at    CG33057    FBgn0053057    318833    nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium transmembrane transporter activity
1622892_s_at    mkg-p    FBgn0035889    38955    nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium transmembrane transporter activity
1622893_at    IM3    FBgn0040736    50209    aminopeptidase activity:metalloexopeptidase activity:hydrolase activity:manganese ion binding
1622894_at    CG15120    FBgn0034454    37248    protein binding

what I would like to get after processing the data is something like that (also tab-separated):

GO    genes
protein binding     FBgn0001128    FBgn0053057     FBgn0035889     FBgn0034454 ...
ammonium transmembrane transporter activity      FBgn0053057    FBgn0035889 ...
hydrolayse activity   FBgn0040736     FBgn0001128 ...

each row of the output file should contains all gene IDs from the input file, that have the GO name of the row.

I tried it a few months ago and it worked, but now somehow I cannot figure out why it doesn't wrok the way i want it.

This is what I do (this snippet was recommended by J. W. McDonald):

dat = read.delim(input data)
lst = tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF"])
lst2 = lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
unlst = cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2, use.names = FALSE))
done = tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])

But somehow now I loose all the GO categories. When running the script, this is the output:

lst <- tapply(1:nrow(dat), dat$flybasegeneid, function(x) dat[x,"GOMF"])

lst

 FBgn0001128 FBgn0034454 FBgn0035889 FBgn0040736 FBgn0053057 
      2           4           3           1           3

I would like to thank in advance for any ideas as to why this is not working the way I need it

Thanks

Assa

go clustering annotation • 2.2k views

ADD COMMENT • link updated 11.7 years ago by JC 13k • written 11.7 years ago by Assa Yeroslaviz ★ 1.8k

score 4 · Answer 1 · 2012-08-28

I'm more proficient in Perl, so here is my solution:

#!/usr/bin/perl

use strict;
use warnings;

my %data;
while (<>) {
    next if (m/^id/); # skip header
    chomp;
    my ($id, $gen, $gid, $ent, $go) = split(/\t/, $_);
    my @go = split(/:/, $go);
    foreach $go (@go) {
        $data{$go}{$gid} = 1;
    }
}

print "GO\tgenes\n";
foreach my $go (sort keys %data) {
    my @gid = sort keys %{ $data{$go} };
    print join "\t", $go, @gid;
    print "\n";
}

Using your example:

$ perl parser.pl < data.txt
GO    genes
3',5'-cyclic-nucleotide phosphodiesterase activity    FBgn0001128
ATP binding    FBgn0035889    FBgn0053057
aminopeptidase activity    FBgn0040736
ammonium transmembrane transporter activity    FBgn0035889    FBgn0053057
carboxylesterase activity    FBgn0001128
chaperone binding    FBgn0035889    FBgn0053057
hydrolase activity    FBgn0001128    FBgn0040736
manganese ion binding    FBgn0040736
metalloexopeptidase activity    FBgn0040736
nucleotide binding    FBgn0035889    FBgn0053057
protein binding    FBgn0001128    FBgn0034454    FBgn0035889    FBgn0053057