Ensembl encoding problem
3 months ago
giammafer • 0

Hi everybody, I have a problem with two files downloaded from Ensembl FTP

Homo_sapiens.GRCh38.pep.all.fa.gz - http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/pep/

Homo_sapiens.GRCh38.104.chr.gtf.gz - http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/

I need to open it on Windows system. I tried with Word and WordPad. But it seems that the encoding is not recognized. Indeed, Word suggests a list of possible encoding when I try to open the files. But none of them is suitable to be used to translate the files in a readable format.

I also tried to open them with a Python script but I get always the same error

def file_head(file_name, number_of_lines, encode="utf8"):
file_hand = open(file_name, 'r', encoding=encode)
for i,line in enumerate(file_hand):
print(line)
if i > number_of_lines:
break
file_hand.close()

# ------------ MAIN --------------

filename = 'myfasta.fasta'


The error message is always like that:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte


I think that these files from Ensembl are used a lot by researchers. But I did not find any valid solution on the web. I do not know where I mistake.

3 months ago

Did you use Firefox to download? Firefox, for some reason, double-zips the already zipped files, which means when you unzip them, you just get binary files again. We recommend using another browser or a command line option (eg wget) to download.

Yes Emily you are right.

I used Chrome instead of Firefox, I unzipped with WinRar and it works. Now I can open them in Word and with Python script.

I was not aware of the double zip behaviour of Firefox.

Thank you very much.

It's not just our files it does it to, so be aware.

3 months ago

Have you used gunzip (or another decompression tool) to decompress these after having downloaded them? Can you show all commands after you downloaded the original files?

I can open one of these files on Windows 10, after having decompressed with 7-Zip:

Kevin

Thank you Kevin for the suggestion.

I was using WinRar but I installed 7zip and it works.

However, I think that the main problem was the double zip due to the Firefox download.

3 months ago
Mensur Dlakic ★ 14k

These files are gzipped, which is a form of compression. You need a program called gunzip to unpack them - they will lose the .gz extension after unpacking, and become ordinary text files that can be opened in Word or WordPad.