Question: getting error i don't understand using biopython for retrieving sequences from fast file
0
gravatar for andynkili
3.3 years ago by
andynkili10
andynkili10 wrote:

I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and i'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code:

rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")

for k in rec_dict.keys():

    c=c+1

    if len(rec_dict[k].seq)>7000:

            handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")

            handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")

            handle.close()

i'm using a dictionary for avoiding loading everything in memory. The variable "c" is used to know how many iterations are made before THIS error pops up:

Traceback (most recent call last):

  File "<stdin>", line 4, in <module>

IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'

 

when i print the value of "c", i get 9013 while the file contains 10447 sequences, meaning the for loop didn't go through all the sequences (the count is done before the "if" condition, so the i count all the iterations, not only those which match the condition). i don't understand the INPUT/OUTPUT error, it should create the 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'  file instead of verifying its existence, shouldn't it?

 

biopython fasta • 1.3k views
ADD COMMENTlink modified 3.3 years ago by Matt Shirley8.9k • written 3.3 years ago by andynkili10
1

 

You shouldn't use the SeqIO index dictionary like this - every time you access a record with rec_dict[k] then it gets parsed from disk again, which is comparatively slow. Also, by accessing the records from the file in essentially random order (the key order in the dictionary) you are again going to slow down the disk IO. The solution is much simpler - just iterate of the file in one parse.

for record in SeqIO.parse("hpv_id_name_all.fasta","fasta"):
    if len(record) > 700:
        # do stuff with record

Also you are not using Biopython for the FASTA output, which is sometimes appropriate.

ADD REPLYlink written 3.3 years ago by Peter5.8k

+1 for FASTA line wrapping.

ADD REPLYlink written 3.3 years ago by Matt Shirley8.9k

Are you sure? because doing what you've proposed is exactly what i was trying to avoid by using the dictionary. i thought iterate through the file was slower than indexing it and accessing only the record i needed by their key.

ADD REPLYlink written 3.3 years ago by andynkili10
2

In order to get the length of each sequence, you have to look at the entire record for that sequence. So yes, a single parse though the file is the simplest and fastest solution here.

ADD REPLYlink written 3.3 years ago by Peter5.8k

May i add that i've checked for the sequence from which the file should be created (in my code each sequence is written in a fasta file named after the sequence's id in the original fasta file(hpv_id_name_all.fasta, see code)). And this sequence ('EU410347.1|Human papillomavirus FA75/KI88-03_7401') has no particularity that could explain the IOError  

ADD REPLYlink written 3.3 years ago by andynkili10

This is a duplicate of a question on another site http://stackoverflow.com/questions/34015094/ioerror-while-retrieving-sequences-from-fasta-file-using-biopython - please don't do this (unless you include the link to the other question) as you are wasting volunteers' time helping you if someone else has already answered the question elsewhere.

ADD REPLYlink written 3.3 years ago by Peter5.8k
3
gravatar for dschika
3.3 years ago by
dschika290
European Union
dschika290 wrote:

The '/' in 'EU410347.1|Human papillomavirus FA75/KI88-03_7401' is the problem, as python assumes there should be a folder EU410347.1|Human papillomavirus FA75/ where to write the file KI88-03_7401.fasta. Replace the '/' with e.g. '_'. Btw: I would also replace the '|' symbol in filenames.

ADD COMMENTlink written 3.3 years ago by dschika290

Thank you very much, i changed every "/" in the file and it worked well. 

ADD REPLYlink written 3.3 years ago by andynkili10
1
gravatar for Matt Shirley
3.3 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

You can also split up a fasta file like this using pyfaidx:

$ pip install pyfaidx
$ faidx hpv_id_name_all.fasta --split-files --size-range 7000,999999999

Or if you want to write your own script I would suggest using the Fasta class from the module. 

ADD COMMENTlink written 3.3 years ago by Matt Shirley8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2333 users visited in the last hour