Weird characters in yeast reference gff3
1
0
Entering edit mode
3 months ago
liorglic ▴ 430

I am looking at the yeast reference annotation (in gff3 format) downloaded from either SGD or Ensembl fungi. In both cases, the gff3 file appears to contain weird characters in the attributes field, which cause me a world of trouble downstream. Example:

chrXVI  SGD     gene    174343  174756  .       -       .       ID=YPL197C;Name=YPL197C;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%3B%20unlikely%20to%20encode%20a%20functional%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data%3B%20partially%20overlaps%20the%20ribosomal%20gene%20RPB7B;display=Dubious%20open%20reading%20frame;dbxref=SGD:S000006118;orf_classification=Dubious


See the "%20" and "%3B" characters?
As far as I understand these are UTF-8 hex representations of certain characters, but why are they included this way? and how can I get rid of them?

To view the full gff file, download the genome release, extract the tar.gz , and look at the file saccharomyces_cerevisiae_R64-2-1_20150113.gff

SGD gff3 yeast gff • 412 views
0
Entering edit mode

%20 represents a space and you could replace with _ using sed.

0
Entering edit mode

looks like html/javascript encoding of, for instance space (%20), and such ...

you could look up what they encode and 'translate' them back? However, also whit space might cause problems downstream, so perhaps better to translate them to something else? ( _ for instance?)

0
Entering edit mode

I am wondering what type of analysis you do that those characters cause troubles?

0
Entering edit mode

I had issues working with the gff using the python package gffutils, since it decodes the UTF8, so if I have e.g. "%3B" (UTF8 for ';'), it creates an invalid feature record. In any case, I removed all occurrences of "%3B" using sed and this seems to solve the issue.

2
Entering edit mode
3 months ago
Juke34 ★ 6.3k

It is URL encoded as it is required by the specifications. The specifications can have slighlty evolved since the creation of this file. See the most up to date here. You probalby have such line ##gff-version 3.1.26 at the top of the file that should inform you which version of the specification has been followed.

Description of the Format

GFF3 files are nine-column, tab-delimited, plain text files. Literal use of tab, newline, carriage return, the percent (%) sign, and control characters must be encoded using RFC 3986 Percent-Encoding; no other characters may be encoded. Backslash and other ad-hoc escaping conventions that have been added to the GFF format are not allowed. The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of UTF-8 is recommended.

tab (%09) newline (%0A) carriage return (%0D) % percent (%25) control characters (%00 through %1F, %7F) In addition, the following characters have reserved meanings in column 9 and must be escaped when used in other contexts:

; semicolon (%3B) = equals (%3D) & ampersand (%26) , comma (%2C) Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces. Use of the "+" (plus) character to encode spaces is deprecated from early versions of the spec and is no longer allowed.

Undefined fields are replaced with the "." character, as described in the original GFF spec.