Question: Parser to provide references to publications and ID numbers from GenBank
gravatar for Alice
4.7 years ago by
Alice270 wrote:

Hello, biostars!

I have a list of GenBank specimen vouchers. Does anyone know some parser or script to get accession numbers and publications information about these sequences? 

To make a table like:

Vouchers: VL03F635, WE02491

IDs: EF104615, AY557140, AY556740

References: Wiemers, 2003; Wiemers et al., 2010


It's not difficult to write a python script for that aim, but the problem is in GenBank records. Years and authors appear like different fields and i do not know how to process such a data.



sequencing python • 1.0k views
ADD COMMENTlink modified 4.7 years ago by David W4.7k • written 4.7 years ago by Alice270
gravatar for David W
4.7 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

Hi Alice, 

I was at a meeting last week where we decided to work on exactly this! If you wait a week or two we should have a tool to do this automatically in R

Until then you should check out Biopython, and espciall the genbank parser

from Bio import SeqIO
from Bio import Entrez

gb_ids = ["EF104615", "AY557140", "AY556740"]

query = Entrez.efetch(db="nucleotide", id=gb_ids, rettype="gb", retmode="text")

for rec in SeqIO.parse(query, "genbank"):
    for paper in rec.annotations["references"]:
        if paper.title != "Direct Submission":
            print("{0} ({1})".format(paper.authors, paper.journal))

Check out the Bio.SeqFeature.Reference docs to see what is represented by these objects




ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by David W4.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 765 users visited in the last hour