Question: Retrieve qualifiers plasmid and pathovar
0
gravatar for felipelira3
17 months ago by
France/Angers/IRHS
felipelira30 wrote:

I am trying to extract some pieces of information from the .gbk file. For the most common information, I can manage the problem. The problem is when I try to extract the info of "plasmid" and "pathovar" when they are not present. Both files used for tests are in https://github.com/felipelira/files_to_test.

A simple print of the 'features.qualifiers' from each file I got this: ...

Qualifiers from example 1 indicate that not all sequences in the file are from the chromosome, because I have two plasmids. The problem is that including asking for "plasmids" I can retrieve this information in the same way that I can obtain the "country", "organism"...

Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**organism**': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**plasmid**': ['p9853_A'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], 'pathovar': ['actinidiae'], 'strain': ['ICMP 9853'], 'host': ['Actinidia'], 'plasmid': ['p9853_B'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}

Qualifiers from example 2:

{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}

As you can see, file 2 doesn't have the same number of information and for the same script, using qualifiers from SeqIO, it doesn't work.

...**Example 1**
FEATURES             Location/Qualifiers
     source          1..6439609
                     /organism="Pseudomonas syringae pv. actinidiae ICMP 9853"
                     /mol_type="genomic DNA"
                     /strain="ICMP 9853"
                     /host="Actinidia"
                     /db_xref="taxon:1104678"
                     /country="Japan"
                     /collection_date="1984"
                     /pathovar="actinidiae"

Trying with the file from example 1 (Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk ), the output of the script is:

... **Example 2**
FEATURES             Location/Qualifiers
     source          1..267979
                     /organism="Pseudomonas syringae"
                     /mol_type="genomic DNA"
                     /strain="ICMP 3690"
                     /isolation_source="cherry"
                     /db_xref="taxon:317"
                     /country="New Zealand"
                     /collection_date="2010"

The files that I used for this script are on GitHub And the script is:

import sys
from Bio import SeqIO
from Bio import GenBank

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            source = seq_feature.qualifiers['organism'][0].replace(' ','_')
            strain = seq_feature.qualifiers['strain'][0]
            country = seq_feature.qualifiers['country'][0]
            #print seq_feature.qualifiers
            host = seq_feature.qualifiers['host'][0]
            print host

I just included the option 'host' because the problem is the same for 'plasmid' and 'pathovar'.

Anybody can help me? Any suggestion using these same modules of python? I look for the better pythonic way to solve this.

Thank you in advance.

seqio python • 439 views
ADD COMMENTlink modified 17 months ago by mobiusklein160 • written 17 months ago by felipelira30
3
gravatar for mobiusklein
17 months ago by
mobiusklein160
United States
mobiusklein160 wrote:

If you would like your script to run to completion even when that information is not present, you can wrap each access to seq_feature.qualifiers in a try-except block catching KeyError and IndexError:

import sys
from Bio import SeqIO
from Bio import GenBank

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            try:
                source = seq_feature.qualifiers['organism'][0].replace(' ','_')
            except (KeyError, IndexError):
                source = None
            try: 
                strain = seq_feature.qualifiers['strain'][0]
            except (KeyError, IndexEror):
                strain = None
            try:
                country = seq_feature.qualifiers['country'][0]
            except (KeyError, IndexError):
                country = None
            #print seq_feature.qualifiers
            try:
                host = seq_feature.qualifiers['host'][0]
            except (KeyError, IndexError):
                host = None
            print host

It's then up to you to make sure that whatever you do with these values is aware they may not be strings.

ADD COMMENTlink written 17 months ago by mobiusklein160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour