Semicolon vs vertical line separating GO annotations in .go file
0
0
Entering edit mode
4.6 years ago
CephBirk ▴ 20

I have a .go file from the supplementary material of a paper and am trying to understand the formatting. Here's a sample:

Ocbimv22017828m GO:chromatin binding ; GO:0003682;GO:protein binding ; GO:0005515
Ocbimv22003392m GO:protein binding ; GO:0005515
Ocbimv22036412m GO:DNA binding ; GO:0003677|GO:DNA-directed RNA polymerase activity ; GO:0003899|GO:transcription, DNA-templated ; GO:0006351
Ocbimv22003166m GO:scavenger receptor activity ; GO:0005044|GO:membrane ; GO:0016020
Ocbimv22034134m GO:protein binding ; GO:0005515|GO:zinc ion binding ; GO:0008270
Ocbimv22036284m GO:sequence-specific DNA binding transcription factor activity ; GO:0003700|GO:regulation of transcription, DNA-dependent ; GO:0006355|GO:nucleus ; GO:0005634
Ocbimv22004380m GO:transmembrane transporter activity ; GO:0022857|GO:transmembrane transport ; GO:0055085|GO:integral to membrane ; GO:0016021


The Ocbimv22... are the transcript IDs and each has corresponding GO terms. However, sometimes GO terms are separated by semicolons and sometimes by vertical lines. Is this a standard file format (I'm new to this field)? I've tried contacting the corresponding author but have not heard word back... Does semicolon mean something different than vertical line? Or is it safe to assume they're synonymous?

go formatting gene ontology • 996 views
1
Entering edit mode

Doesn't seem a standard format, and in fact it seems kind of messy. Semicolons sometimes are separating the GO accession number from its name, like in

GO:protein binding ; GO:0005515

Other times, semicolons are separating pairs of accessions / names, like in

GO:chromatin binding ; GO:0003682 ; GO:protein binding ; GO:0005515

I think it just means the formatting is wrong, with semicolons meant to separate a GO accession from its name, and pipes (the vertical bars) meant to separate different pairs of accessions / names, but sometimes semicolons were erroneously used.

It would be helpful if you included a link to the paper. Did you read it materials and methods? What did it say about transcriptome annotation? What software was used?

1
Entering edit mode

With the exception very first entry (from OP), for a given transcript, it looks like this to me:

• All GO terms have both ID and description (probably MF of GO).

for eg. GO term pair with Description and ID : GO:chromatin binding ; GO:0003682

from http://amigo.geneontology.org/amigo/term/GO:0003682, GO:0003682 description is GO:chromatin binding

• For each GO term, Description and ID are separated by semi colon and a space before and after semi colon. Format is GO term description ; GO term ID.

for eg. GO:chromatin binding ; GO:0003682

• Each GO term is separated from next one by a tight pipe (| - pipe with no spaces before and after)

for eg. GO:0005515 and GO:0008270 are separated by | by no spaces before after each GO term. GO:protein binding ; GO:0005515|GO:zinc ion binding ; GO:0008270

• For lines such as first line (where each GO term is separated by a tight ; instead of tight |), one can use regex replacement, so separate GO terms.

Eg.

 echo "Ocbimv22017828m GO:chromatin binding ; GO:0003682;GO:protein binding ; GO:0005515" | sed 's/;/|/g;s/ | / ; /g'