Running Biopython 1.57 on Bio-Linux 6. I have a list of names of genes that are conserved across several bacterial genomes. In order to pull these out I'd like to make a dictionary that parses a defline that looks like this:
>lcl|NC_000913.2_cdsid_NP_414542.1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
and uses the gene name as the key. So I wrote:
def get_gene(identifier):
    seqrline = identifier.description.split(' [')
    seqrline = [x.strip(']') for x in seqrline]
    gene_name = seqrline[1].lstrip('gene=')
    return gene_name
handle = open('NC_000913.faat', 'r')
record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
But there are duplicates, naturally:
>>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/biopython-1.57-py2.6-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 673, in to_dict
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'insB'
Is there any way to get around this?
EDIT: Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict by means of a grep chain:
grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys
The question about duplicate keys still stands, though.
Ok, so there was an obvious way to get to the same place using the default keys generated by
SeqIO.to_dictgrep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys The question about duplicate keys still stands, though.