Question: Biopython drops GenBank LOCUS data?
0
gravatar for jamesfowkes
3.5 years ago by
jamesfowkes0 wrote:

Hello everyone,

I'm helping to write a Python application for storage and simple processing of RNA/DNA/protein sequences. I come from a software background and have no formal biology training at all (beyond high school). I just have a question about how/why Biopython pull information from GenBank.

I'm using Biopython to get a SeqRecord object for a GenBank ID using this function:

def get_seqrecord(gid, db="nucleotide"):
    try:
        handle = Entrez.efetch(db=db, id=gid, rettype="gb", retmode="text")
        return SeqIO.read(handle,"genbank")
    except:
        return None

This is fine, but the user also wants to store the "molecule type" for the sequence, from the LOCUS line, e.g.

LOCUS       JX978171               39269 bp    DNA     linear   INV 14-MAR-2013

would store "DNA"

LOCUS       AY994149                2957 bp    mRNA    linear   INV 01-JUN-2005

would store mRNA.

From what I can tell, Biopython parses that line and extracts the molecule type and uses it to set the sequence alphabet type. In the process, the precise molecule type is lost.

My questions are:

- Is there a reason it does this? Should I not be relying on the molecule type from the LOCUS line?
- Short of manually parsing the LOCUS line (which I would just do by copy-pasting the relevant Biopython code into my own function), is there a way I can get this information for a particular GenBank ID?

Thanks in advance for any help. Sorry if this is a really stupid question, I'm still learning huge swathes of this stuff.

 

sequence • 1.3k views
ADD COMMENTlink modified 4 months ago by Mathis DKZ10 • written 3.5 years ago by jamesfowkes0
1
gravatar for Mathis DKZ
4 months ago by
Mathis DKZ10
Germany
Mathis DKZ10 wrote:

Even if this is an older post, some people might still look for this. Meanwhile it's possible to get the molculetype from the Locus:

for seq_record in SeqIO.parse(handle,"genbank"):  

        print(seq_record.annotations['molecule_type'])

All Locus Informations are name, sequence length, molecule type, molecule topology, Genbank division and date. This is how you get each out of a seq_record:

name = seq_record.name
seq_length = len(seq_record.seq)   
molecule_type = seq_record.annotations['molecule_type']
molecule_topology = seq_record.annotations['topology']
division = seq_record.annotations['data_file_divison']
date =  seq_record.annotations['date']

I hope this can help someone.

ADD COMMENTlink written 4 months ago by Mathis DKZ10
0
gravatar for Peter
3.5 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

This is a known limitation, see https://github.com/biopython/biopython/issues/363

ADD COMMENTlink written 3.5 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1092 users visited in the last hour