Ensembl encoding problem
3
0
Entering edit mode
2.8 years ago
giammafer ▴ 20

Hi everybody, I have a problem with two files downloaded from Ensembl FTP

Homo_sapiens.GRCh38.pep.all.fa.gz - http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/pep/

Homo_sapiens.GRCh38.104.chr.gtf.gz - http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/

I need to open it on Windows system. I tried with Word and WordPad. But it seems that the encoding is not recognized. Indeed, Word suggests a list of possible encoding when I try to open the files. But none of them is suitable to be used to translate the files in a readable format.

enter image description here

I also tried to open them with a Python script but I get always the same error

def file_head(file_name, number_of_lines, encode="utf8"):
    file_hand = open(file_name, 'r', encoding=encode)
    for i,line in enumerate(file_hand):
        print(line)
        if i > number_of_lines:
            break
    file_hand.close()

# ------------ MAIN --------------

filename = 'myfasta.fasta'
file_head(filename, 50)

The error message is always like that:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I think that these files from Ensembl are used a lot by researchers. But I did not find any valid solution on the web. I do not know where I mistake.

Thank you in advance for your help.

Ensembl encoding • 1.4k views
ADD COMMENT
2
Entering edit mode
2.8 years ago
Emily 23k

Did you use Firefox to download? Firefox, for some reason, double-zips the already zipped files, which means when you unzip them, you just get binary files again. We recommend using another browser or a command line option (eg wget) to download.

ADD COMMENT
0
Entering edit mode

Yes Emily you are right.

I used Chrome instead of Firefox, I unzipped with WinRar and it works. Now I can open them in Word and with Python script.

enter image description here

I was not aware of the double zip behaviour of Firefox.

Thank you very much.

ADD REPLY
0
Entering edit mode

It's not just our files it does it to, so be aware.

ADD REPLY
1
Entering edit mode
2.8 years ago

Have you used gunzip (or another decompression tool) to decompress these after having downloaded them? Can you show all commands after you downloaded the original files?

I can open one of these files on Windows 10, after having decompressed with 7-Zip: ddd

Kevin

ADD COMMENT
0
Entering edit mode

Thank you Kevin for the suggestion.

I was using WinRar but I installed 7zip and it works.

However, I think that the main problem was the double zip due to the Firefox download.

ADD REPLY
0
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 27k

These files are gzipped, which is a form of compression. You need a program called gunzip to unpack them - they will lose the .gz extension after unpacking, and become ordinary text files that can be opened in Word or WordPad.

ADD COMMENT

Login before adding your answer.

Traffic: 2987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6