Parser to provide references to publications and ID numbers from GenBank
1
0
Entering edit mode
9.8 years ago
Alice ▴ 320

Hello, biostars!

I have a list of GenBank specimen vouchers. Does anyone know some parser or script to get accession numbers and publications information about these sequences?

To make a table like:

Vouchers: VL03F635, WE02491
IDs: EF104615, AY557140, AY556740

References: Wiemers, 2003; Wiemers et al., 2010

It's not difficult to write a python script for that aim, but the problem is in GenBank records. Years and authors appear like different fields and i do not know how to process such a data.

python sequencing • 1.8k views
ADD COMMENT
2
Entering edit mode
9.8 years ago
David W 4.9k

Hi Alice,

I was at a meeting last week where we decided to work on exactly this! If you wait a week or two we should have a tool to do this automatically in R.

Until then you should check out Biopython, and especially the genbank parser

from Bio import SeqIO
from Bio import Entrez

gb_ids = ["EF104615", "AY557140", "AY556740"]

query = Entrez.efetch(db="nucleotide", id=gb_ids, rettype="gb", retmode="text")

for rec in SeqIO.parse(query, "genbank"):
    print(rec.id)
    for paper in rec.annotations["references"]:
        if paper.title != "Direct Submission":
            print("{0} ({1})".format(paper.authors, paper.journal))

Check out the Bio.SeqFeature.Reference docs to see what is represented by these objects

ADD COMMENT

Login before adding your answer.

Traffic: 1874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6