Question: Master dictionary of database file (2D dictionary) using Biopython, then specify key with string
gravatar for incensefrenzie2006
2.5 years ago by
incensefrenzie20060 wrote:

My goal is to create a 2d dictionary, search for some sequence ids, and then writes the organism name with the amino acid sequence to a file

I have a working code that creates one dictionary, and looks for the id numbers that I want, and writes it to a file. I am unable to iterate through all my files, but it only returns the id number. I was looking to return the organism name as the key.

I have seen many examples on how to parse a single file as a dictionary to retrieve a dictionary as id:seq. Another example I seen seems to turn the header into a string, then split, but I am unsuccessful. Some of my headers have commas, and some do not. The examples I seen were splitting on ("|")

>EFE00375.1 S-adenosylmethionine-dependent methyltransferase, YraL family [Lactobacillus crispatus 214-1]

My command python L_214.txt P_1_Results.txt P_1_Clustal.txt

import sys
from Bio import SeqIO

aa_db_file = sys.argv[1]  # Amino Acid Database file ~ 17 files
accession_id_file = sys.argv[2] # Accession IDs  file ~ 18 accession id numbers
file_for_clustal = sys.argv[3] # Output fasta file

wanted = set()
with open(accession_id_file) as f:
    for line in f:
        line = line.strip()
        if line != "":

fasta_database = SeqIO.parse(open(aa_db_file),'fasta')
#fasta_database = Seq.IO.index("file_name", "fasta")  Also seen this in many examples

with open(file_for_clustal, "w") as f:
    for seq in fasta_database:
        if in wanted:
            SeqIO.write([seq], f, "fasta")

#Desired output
#crispatus 214-1:seq
dictionary python fasta • 1.2k views
ADD COMMENTlink modified 2.5 years ago by cindy.perscheid90 • written 2.5 years ago by incensefrenzie20060

Hi, could you please specify your concrete question here (it seems there are a couple of them)?

Besides, as your question is rather of a "programming" nature, I would recommend to search for your problem on Google and especially at Stackoverflow (kind of biostars for programmers). From my experience, there is no question that has not yet been asked there, so if you can specify your concrete problem you should find a solution there.



ADD REPLYlink written 2.5 years ago by cindy.perscheid90

Well, if the desired information is always hidden inside these brackets and you can make sure that these are the only occurrences of those brackets, why don't you search for the occurrences of "[" and "]" to retrieve start and end index, and then get the corresponding substring of rec.description?

ADD REPLYlink written 2.5 years ago by cindy.perscheid90

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 2.5 years ago by genomax83k


I know have a for loop that seems to work

fasta_files = "L_214.txt","L_224-1.txt" # two files that are used as my database(s)
recs = [rec for f in fasta_files for rec in SeqIO.parse(f, 'fasta')]
print recs


id='EFB61523.1', name='EFB61523.1', description='EFB61523.1 cyclic nucleotide-binding domain protein, partial [Lactobacillus gasseri 224-1]', dbxrefs=[]), SeqRecord(seq=Seq('MLLVKILKNYYQAVFNKNADEVKVISDVLEKAGWKDFVSVLPHLNS', SingleLetterAlphabet())

How can I parse the recs? I want what is inside those brackets?

Lactobacillus gasseri 224-1:SeqRecord
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by incensefrenzie20060
import re

d = {}
fasta_files = "L_214.txt","L_224-1.txt"
for f in fasta_files:
    for rec in SeqIO.parse(f, 'fasta'):
        desc = rec.description
        if '[' in desc:
            organism = re.findall('\[([^\]]+)', desc)[0]
            d[organism] = rec
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by a.zielezinski9.1k

the re module is very useful in this case. Now I am curious how to iterates through my results files, which is basically a handful of text files, with one column of data.

How could I iterate through these files and check my master dictionary??

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by incensefrenzie20060

I was thinking about the glob function, but maybe there is an easier method

path = '/Users/sueparks/' + 'P_*' + '_Results.txt'
for file in glob.glob(path):
    with open(file, 'r') as f:
        line = f.readlines()
        print line
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by incensefrenzie20060
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1064 users visited in the last hour