Question

Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether?

3

Entering edit mode

12.8 years ago

James Estevez ▴ 90

Running Biopython 1.57 on Bio-Linux 6. I have a list of names of genes that are conserved across several bacterial genomes. In order to pull these out I'd like to make a dictionary that parses a defline that looks like this:

>lcl|NC_000913.2_cdsid_NP_414542.1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]

and uses the gene name as the key. So I wrote:

def get_gene(identifier):
    seqrline = identifier.description.split(' [')
    seqrline = [x.strip(']') for x in seqrline]
    gene_name = seqrline[1].lstrip('gene=')
    return gene_name

handle = open('NC_000913.faat', 'r')
record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)

But there are duplicates, naturally:

>>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/biopython-1.57-py2.6-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 673, in to_dict
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'insB'

Is there any way to get around this?

EDIT: Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict by means of a grep chain:

grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys

The question about duplicate keys still stands, though.

biopython • 4.8k views

ADD COMMENT • link updated 7.6 years ago by Biostar 20 • written 12.8 years ago by James Estevez ▴ 90

0

Entering edit mode

Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys The question about duplicate keys still stands, though.

ADD REPLY • link 12.8 years ago by James Estevez ▴ 90

score 7 · Answer 1 · 2011-07-27

SeqIO.to_dict is meant to handle some sets of standard cases, but here you should parse the file and build up a dictionary correctly handling duplicate genes. Assuming you want to collect all of the records for a gene name together as a list:

import collections

from Bio import SeqIO

record_dict = collections.defaultdict(list)
with open('test.fa', 'r') as in_handle:
    for rec in SeqIO.parse(in_handle, "fasta"):
        cur_id = get_gene(rec)
        record_dict[cur_id].append(rec)

for key, vals in record_dict.iteritems():
    print key, vals

which for a small test file like:

> test [gene=A] 
GATC
> test2 [gene=B] 
GATC
> test3 [gene=A] 
GATC

will generate:

A [SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test', name='test', description=' test [gene=A]', dbxrefs=[]), 
   SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test3', name='test3', description=' test3 [gene=A]', dbxrefs=[])]
B [SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test2', name='test2', description=' test2 [gene=B]', dbxrefs=[])]