goatools: how to write out (or draw) the entire GO hierarchy for all genes from one species (human)?
1
0
Entering edit mode
5.8 years ago
dp ▴ 50

I am trying to write out the entire GO hierarchy for humans using goatools. I don't have much experience with the package.

I know that read_ncbi_gene2go("assoc", taxids=[9606]) outputs a dictionary of human genes: GO id sets association with those genes but can't figure out how to organize and output that result in the GO hierarchy structure.

Thanks!

go goatools • 2.3k views
ADD COMMENT
0
Entering edit mode
5.7 years ago

Try this:

First, update your clone of GOATOOLS. We just published a research paper this week and there is much new code that accompanies the paper, some which affects the answer to your question:

Then, from the goatools repo directory, do this:

$ scripts/wr_hier.py BP MF CC --gene2go=gene2go --taxid=9606 --dash_len=17 --concise -o human_BP_MF_CC.txt

BP MF CC => Aliases for top GO IDs for biological_process(GO:0008150), molecular_function(GO:0003674) & cellular_process(GO:0005575)

--gene2go=gene2go --taxid=9606 => Tells wr_hier to read the associations in file, "gene2go" for human(9606)

--dash_len=17 => Adds spacing so the GO information printed next to each GO ID does not shift around. The BP branch is 17 GO IDs deep, so 17 is the right number for this print.

--concise => Because GO IDs can have multiple parents, the same sub-path can be printed multiple times, taking up gobs of space in a report and making the hierarchy report more difficult to read. To print a sub-path just once when it occurs more than one time, use this arg. If you use this arg, your report will be 37k lines. If you do not use this arg, your report will be 351,557 lines due to duplication.

Note

Drawing a big hierarchy will result in an unreadable plot and likely it won't complete, so doing a hierarchy report is preferable to a plot in this case.

Further Details

GO IDs that are in the association will be marked with a ">" at the beginning of the line. The numbers on the GO line include the total count of all descendants (e.g., 265) determined from the GO DAG and the information content score determined from the associations (e.g., 7.01)

  --- GO:0009628   BP 265  7.01 D02 response to abiotic stimulus
  ---- GO:0071214  BP  68  8.30 D04 cellular response to abiotic stimulus
> ----- GO:0071257 BP   0 11.48 D05 cellular response to electrical stimulus

The two ancestor GO IDs preceding the marked(>) GO:0071257 are not in the human association, but are in the hierarchy leading down to GO:0071257 , which is in the association.

--concise causes a line which uses '=' to indicate depth instead of '-':

> ======== GO:0071295  BP 13 10.97 L05 D07 cellular response to vitamin

The use of '=' instead of '-' to show a level of depth means that this GO AND its lower level terms were printed earlier and will not be duplicated to save space. In the case of GO:0071295, the first detailed print looks like this:

> -------- GO:0071295  BP 13 10.97 L05 D07 cellular response to vitamin
> --------- GO:0071307 BP  0 13.27 L06 D08 cellular response to vitamin K
> --------- GO:0071231 BP  0 13.27 L05 D08 cellular response to folic acid
> --------- GO:0071301 BP  0 13.96 L05 D08 cellular response to vitamin B1
> --------- GO:0071298 BP  0 13.96 L05 D08 cellular response to L-ascorbic acid
> --------- GO:0071305 BP  0 11.56 L05 D08 cellular response to vitamin D
> --------- GO:0071306 BP  0 13.27 L05 D08 cellular response to vitamin E
ADD COMMENT

Login before adding your answer.

Traffic: 2221 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6