Running Biopython 1.57 on Bio-Linux 6. I have a list of names of genes that are conserved across several bacterial genomes. In order to pull these out I'd like to make a dictionary that parses a defline that looks like this:
>lcl|NC_000913.2_cdsid_NP_414542.1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
and uses the gene name as the key. So I wrote:
def get_gene(identifier): seqrline = identifier.description.split(' [') seqrline = [x.strip(']') for x in seqrline] gene_name = seqrline.lstrip('gene=') return gene_name handle = open('NC_000913.faat', 'r') record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
But there are duplicates, naturally:
>>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.6/dist-packages/biopython-1.57-py2.6-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 673, in to_dict raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'insB'
Is there any way to get around this?
EDIT: Ok, so there was an obvious way to get to the same place using the default keys generated by
SeqIO.to_dict by means of a grep chain:
grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys
The question about duplicate keys still stands, though.