I am trying to cluster genes in separate GO categories in the following way:
input data (in a tab-delimited format):
id flybasename_gene flybase_gene_id entrezgene GOMF
1616608_a_at Gpdh FBgn0001128 33824 carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide phosphodiesterase activity:protein binding
1622892_s_at CG33057 FBgn0053057 318833 nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium transmembrane transporter activity
1622892_s_at mkg-p FBgn0035889 38955 nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium transmembrane transporter activity
1622893_at IM3 FBgn0040736 50209 aminopeptidase activity:metalloexopeptidase activity:hydrolase activity:manganese ion binding
1622894_at CG15120 FBgn0034454 37248 protein binding
what I would like to get after processing the data is something like that (also tab-separated):
GO genes
protein binding FBgn0001128 FBgn0053057 FBgn0035889 FBgn0034454 ...
ammonium transmembrane transporter activity FBgn0053057 FBgn0035889 ...
hydrolayse activity FBgn0040736 FBgn0001128 ...
each row of the output file should contains all gene IDs from the input file, that have the GO name of the row.
I tried it a few months ago and it worked, but now somehow I cannot figure out why it doesn't wrok the way i want it.
This is what I do (this snippet was recommended by J. W. McDonald):
dat = read.delim(input data)
lst = tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF"])
lst2 = lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
unlst = cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2, use.names = FALSE))
done = tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
But somehow now I loose all the GO categories. When running the script, this is the output:
lst <- tapply(1:nrow(dat), dat$flybasegeneid, function(x) dat[x,"GOMF"])
lst
FBgn0001128 FBgn0034454 FBgn0035889 FBgn0040736 FBgn0053057
2 4 3 1 3
I would like to thank in advance for any ideas as to why this is not working the way I need it
Thanks
Assa
thanks for the help.