Question: Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether?
3
gravatar for James Estevez
8.0 years ago by
Tacoma, WA
James Estevez90 wrote:

Running Biopython 1.57 on Bio-Linux 6. I have a list of names of genes that are conserved across several bacterial genomes. In order to pull these out I'd like to make a dictionary that parses a defline that looks like this:

>lcl|NC_000913.2_cdsid_NP_414542.1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]

and uses the gene name as the key. So I wrote:

def get_gene(identifier):
    seqrline = identifier.description.split(' [')
    seqrline = [x.strip(']') for x in seqrline]
    gene_name = seqrline[1].lstrip('gene=')
    return gene_name

handle = open('NC_000913.faat', 'r')
record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)

But there are duplicates, naturally:

>>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/biopython-1.57-py2.6-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 673, in to_dict
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'insB'

Is there any way to get around this?

EDIT: Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict by means of a grep chain:

grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys

The question about duplicate keys still stands, though.

biopython • 2.9k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 8.0 years ago by James Estevez90

Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys The question about duplicate keys still stands, though.

ADD REPLYlink written 8.0 years ago by James Estevez90
7
gravatar for Brad Chapman
8.0 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

SeqIO.to_dict is meant to handle some sets of standard cases, but here you should parse the file and build up a dictionary correctly handling duplicate genes. Assuming you want to collect all of the records for a gene name together as a list:

import collections

from Bio import SeqIO

record_dict = collections.defaultdict(list)
with open('test.fa', 'r') as in_handle:
    for rec in SeqIO.parse(in_handle, "fasta"):
        cur_id = get_gene(rec)
        record_dict[cur_id].append(rec)

for key, vals in record_dict.iteritems():
    print key, vals

which for a small test file like:

> test [gene=A] 
GATC
> test2 [gene=B] 
GATC
> test3 [gene=A] 
GATC

will generate:

A [SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test', name='test', description=' test [gene=A]', dbxrefs=[]), 
   SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test3', name='test3', description=' test3 [gene=A]', dbxrefs=[])]
B [SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test2', name='test2', description=' test2 [gene=B]', dbxrefs=[])]
ADD COMMENTlink written 8.0 years ago by Brad Chapman9.4k

How would the schematics of the code change if James had multiple files? For example, NC_000913.faa thru NC_000923.faa.

ADD REPLYlink written 14 months ago by incensefrenzie20060
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1070 users visited in the last hour