Question

getting error I don't understand using biopython for retrieving sequences from fast file

0

Entering edit mode

8.4 years ago

andynkili ▴ 10

I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and I'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code:

rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")
for k in rec_dict.keys():
    c=c+1
    if len(rec_dict[k].seq)>7000:
        handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")
        handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")
        handle.close()

I'm using a dictionary for avoiding loading everything in memory. The variable "c" is used to know how many iterations are made before THIS error pops up:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'

When I print the value of "c", I get 9013 while the file contains 10447 sequences, meaning the for loop didn't go through all the sequences (the count is done before the "if" condition, so the I count all the iterations, not only those which match the condition). I don't understand the INPUT/OUTPUT error, it should create the 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' file instead of verifying its existence, shouldn't it?

biopython fasta • 3.2k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by andynkili ▴ 10

1

Entering edit mode

You shouldn't use the SeqIO index dictionary like this - every time you access a record with rec_dict[k] then it gets parsed from disk again, which is comparatively slow. Also, by accessing the records from the file in essentially random order (the key order in the dictionary) you are again going to slow down the disk IO. The solution is much simpler - just iterate of the file in one parse.

for record in SeqIO.parse("hpv_id_name_all.fasta", "fasta"):
    if len(record) > 700:
        # do stuff with record

Also you are not using Biopython for the FASTA output, which is sometimes appropriate.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Peter 6.0k

0

Entering edit mode

+1 for FASTA line wrapping.

ADD REPLY • link 8.4 years ago by Matt Shirley 10k

0

Entering edit mode

Are you sure? because doing what you've proposed is exactly what i was trying to avoid by using the dictionary. i thought iterate through the file was slower than indexing it and accessing only the record i needed by their key.

ADD REPLY • link 8.4 years ago by andynkili ▴ 10

2

Entering edit mode

In order to get the length of each sequence, you have to look at the entire record for that sequence. So yes, a single parse though the file is the simplest and fastest solution here.

ADD REPLY • link 8.4 years ago by Peter 6.0k

0

Entering edit mode

May I add that I've checked for the sequence from which the file should be created (in my code each sequence is written in a fasta file named after the sequence's id in the original fasta file(hpv_id_name_all.fasta, see code)). And this sequence ('EU410347.1|Human papillomavirus FA75/KI88-03_7401') has no particularity that could explain the IOError

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by andynkili ▴ 10

0

Entering edit mode

This is a duplicate of a question on another site http://stackoverflow.com/questions/34015094/ioerror-while-retrieving-sequences-from-fasta-file-using-biopython - please don't do this (unless you include the link to the other question) as you are wasting volunteers' time helping you if someone else has already answered the question elsewhere.

ADD REPLY • link 8.4 years ago by Peter 6.0k

1

Entering edit mode

8.4 years ago

Matt Shirley 10k

You can also split up a fasta file like this using pyfaidx:

$ pip install pyfaidx
$ faidx hpv_id_name_all.fasta --split-files --size-range 7000,999999999

Or if you want to write your own script I would suggest using the Fasta class from the module.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Matt Shirley 10k

Ram · Accepted Answer · 2015-12-01

3

Entering edit mode

8.4 years ago

dschika ▴ 320

The '/' in 'EU410347.1|Human papillomavirus FA75/KI88-03_7401' is the problem, as python assumes there should be a folder EU410347.1|Human papillomavirus FA75/ where to write the file KI88-03_7401.fasta. Replace the '/' with e.g. '_'. Btw: I would also replace the '|' symbol in filenames.

ADD COMMENT • link 8.4 years ago by dschika ▴ 320

0

Entering edit mode

Thank you very much, I changed every / in the file and it worked well.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by andynkili ▴ 10