Issue reading fasta file with Biopython
0
0
Entering edit mode
19 months ago
Rox ★ 1.4k

Hello everyone,

This should be very easy and I know it, but I am stuck with it and I cannot pinpoint my mistake.

I wanted a boolean python function to check if a given file is in fasta format. And this, without manually checking myself the extension (.fa, .fasta etc). I have found this solution which suited me. When parsing for needed files, my python script now use this "is_fasta" function.

My problem is that for some files it works, for some others it doesn't... When it doesn't I have an error of the sort when trying to read the fasta file :

UnicodeDecodeError: 'utf-8' codec cant decode byte 0xf3 in position 551: invalid continuation byte
#or
UnicodeDecodeError: 'utf-8' codec cant decode byte 0x87 in position 23: invalid start byte


So I understand they might be something with the encoding of the file. I usually check it using the command file, but for files that works as for files that does not works, I get "ASCII text", and when asking for more information with file -i, he just print "regular file". So I don't see anything about utf-8 or so. And my comprehension of file format kind of stop here.

I am working in a conda environment I have made with several tools, the python version inside is 3.6.10. I have added biopython with regular conda command and the channel conda-forge.

Does anyone has an advice about this issue ? Or should I just revert to my original idea to just check the file extension ?

Thank you and have a nice day,

python • 1.1k views
0
Entering edit mode
0
Entering edit mode

Hmmm, indeed good lead. I forgot to say I am on MAC environment (and totally new to it). It seems my LANG variable is empty... I will try to see if playing around that idea helps solving the issue.

0
Entering edit mode

Ah sadly this was not the issue. I use a custom bunch of setting for bash (zsh), and I followed how to properly set the locale following these steps here : https://github.com/ohmyzsh/ohmyzsh/issues/7558 . But yeah, now even with my LANG fixed, it is still not working and showing encoding errors :/