I am looking at the yeast reference annotation (in gff3 format) downloaded from either SGD or Ensembl fungi. In both cases, the gff3 file appears to contain weird characters in the attributes field, which cause me a world of trouble downstream. Example:
chrXVI SGD gene 174343 174756 . - . ID=YPL197C;Name=YPL197C;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%3B%20unlikely%20to%20encode%20a%20functional%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data%3B%20partially%20overlaps%20the%20ribosomal%20gene%20RPB7B;display=Dubious%20open%20reading%20frame;dbxref=SGD:S000006118;orf_classification=Dubious
See the "%20" and "%3B" characters?
As far as I understand these are UTF-8 hex representations of certain characters, but why are they included this way? and how can I get rid of them?
To view the full gff file, download the genome release, extract the tar.gz , and look at the file saccharomyces_cerevisiae_R64-2-1_20150113.gff