Question: Fixing a biopython script
1
gravatar for tpaisie
2.4 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

So I'm trying to actively learn how to use Biopython, and python for that matter. I made a script that extracts the Genbank accession number and the country from a .gb (genbank) file and puts the accession number and country in a text file. Here is my script:

from Bio import SeqIO
gb_file = "denv1.gb"

for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank") :
     printgb_record.name),
     for feat in gb_record.features:
                if feat.type == 'source':
                        source = gb_record.features[0]
                        for qualifiers in source.qualifiers:
                                if qualifiers == 'country':
                                        country = source.qualifiers['country']
                                        print(country[0])

Now the output text file, every accession number with a country is listed perfectly, but when a country is not present for a particular sequence in the genbank file, it does this:

example of .gb file

Instead I would like it to be blank where the name of the country would go. Hopefully this makes sense and isn't too confusing. Can anyone help me out? As I said I'm trying to learn, so pointers or links to help would be greatly appreciated!

genbank biopython python • 972 views
ADD COMMENTlink modified 2.4 years ago by jrj.healey12k • written 2.4 years ago by tpaisie70
1

Couple of other things to point out. with SeqIO.parse() you don't also need to call open(). My suggestion would be to throw in a check for existence of location... have a look at this thread from the other day: C: How do can I use Biopython and SeqIO to parse out multiple genes from several NC

from Bio import SeqIO
import sys

with open(sys.argv[2], 'w') as nfh:
        for rec in SeqIO.parse(sys.argv[1], "genbank"):
                if rec.features:
                        for feature in rec.features:
                                if feature.type == "CDS":
                                        nfh.write(">%s from %s\n%s\n" % (
                                        feature.qualifiers['gene'][0],
                                        rec.name,
                                        feature.location.extract(rec).seq))

On line 5, you're making sure that feature exists before you do something with it. My guess is that your code is failing to detect that a source isnt found in the next record, and populating it with the last thing source evaluated to from a previous record.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by jrj.healey12k
2

ok so i added this to my script:

from Bio import SeqIO
gb_file = "denv1.gb"

# parse genbank file
for gb_record in SeqIO.parse(gb_file, "genbank") :
    # now do something with the record
        printgb_record.name),   # print genbank accession number
        for feat in gb_record.features:
                if feat.type == 'source':
                        source = gb_record.features[0]
                        for qualifiers in source.qualifiers:
                            if qualifiers == 'country':
                                country = source.qualifiers['country']
                                print(country[0]) # prints the country
                        else:
                            print("")

and it does leave a the accession numbers country column blank but it also adds an extra line inbetween each accession number and country:

script text file

ADD REPLYlink written 2.4 years ago by tpaisie70

I suspect this will be to do with carry over of a newline character in the case where the source can actually be found from the file, but not if it's blank.

ADD REPLYlink written 2.4 years ago by jrj.healey12k

So I need to add something to my code to tell it when no country is present in the file it should print nothing? Are you showing that in the above code?

ADD REPLYlink written 2.4 years ago by tpaisie70

I haven't tested your code but that would be my guess. I could be wrong :p

The code above is similar to what you're doing. It extracts info from Genbanks.

I believe you need a statement something like:

source = gb_record.features[0]
if not source:
    source = 'no source information'
ADD REPLYlink written 2.4 years ago by jrj.healey12k

Can you upload your genbank that you're trying this on so I can replicate what you've got?

ADD REPLYlink written 2.4 years ago by jrj.healey12k

So the entire file is to large, but there is a shortened version of the file: http://gobin.io/OBcO

ADD REPLYlink written 2.4 years ago by tpaisie70

Could you format the code using in the biostar post editor ?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by sacha1.7k

There! Thanks for letting me know!

ADD REPLYlink written 2.4 years ago by tpaisie70
0
gravatar for jrj.healey
2.4 years ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Ok here's what I've ended up with:

from Bio import SeqIO
gb_file = "test.gbk"

# parse genbank file
for gb_record in SeqIO.parse(gb_file, "genbank") :
    # now do something with the record
        for feat in gb_record.features:
                if feat.type == 'source':
                        source = gb_record.features[0]
                        for qualifiers in source.qualifiers:
                            if qualifiers == 'country':
                                country = str(source.qualifiers['country'][0])
                            elif country == '':
                                country = 'none'

        printgb_record.name + ' ' + country)

This works on the test data. The file as you provided it returns:

KT279761 Haiti
EU482591 USA: Puerto Rico
HQ332181 Venezuela
FJ810415 Venezuela: Aragua

If I manually change one of the gbk entries so that it doesn't contain any country data:

            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..10735
                     /organism="Dengue virus 1"
                     /mol_type="viral cRNA"
                     /strain="Haiti/1207/2014"
                     /serotype="1"
                     /host="Homo sapiens"
                     /db_xref="taxon:11053"
                     /country=""
                     /collection_date="27-Nov-2014"
     5'UTR           1..94

The same code will return:

KT279761 none
EU482591 USA: Puerto Rico
HQ332181 Venezuela
FJ810415 Venezuela: Aragua

However, without seeing one of your genbanks which doesnt contain the country information I'm not sure if this will work 100% of the time. If the /country= field is missing altogether I think it may still break. For your own practice you may want to alter the code to try and handle that (another simple elif statement would probably do it).

ADD COMMENTlink written 2.4 years ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1930 users visited in the last hour