I have interproscan output of a new genome annotation and I've used blast2go to look at the GO terms at different levels but I now want to produce a summary table of the number of proteins with interproscan family domains.
Does anyone have a script or method to summarize for instance table 2 from this journal (I'll try to email them to see if they have a script and post if get it):
I could just collect a subset of interproscan ID's and do a grep for the intreproscan ID's and count them but wondering if there is a more comperehensive sophisticated method to get all those with family interproscan ID's summarized?
I have downloaded from interproscan their tree relationship file (example given below). The -- are childs of the parent so what I want to do for each parent i.e. IPR015797 sum the number found including the children and sum the children separately.
IPR015797::NUDIX hydrolase domain-like::
--IPR000086::NUDIX hydrolase domain::
IPR015812::Integrin beta subunit::
--IPR012013::Integrin beta-4 subunit::
--IPR015436::Integrin beta-6 subunit::
--IPR015437::Integrin beta-7 subunit::
--IPR015439::Integrin beta-2 subunit::
--IPR015442::Integrin beta-8 subunit::
--IPR027067::Integrin beta-5 subunit::
--IPR027068::Integrin beta-3 subunit::
--IPR027070::Integrin beta-like protein 1::
--IPR027071::Integrin beta-1 subunit::