Master dictionary of database file (2D dictionary) using Biopython, then specify key with string
0
0
Entering edit mode
4.1 years ago

My goal is to create a 2d dictionary, search for some sequence ids, and then writes the organism name with the amino acid sequence to a file

I have a working code that creates one dictionary, and looks for the id numbers that I want, and writes it to a file. I am unable to iterate through all my files, but it only returns the id number. I was looking to return the organism name as the key.

I have seen many examples on how to parse a single file as a dictionary to retrieve a dictionary as id:seq. Another example I seen seems to turn the header into a string, then split, but I am unsuccessful. Some of my headers have commas, and some do not. The examples I seen were splitting on ("|")

>EFE00375.1 S-adenosylmethionine-dependent methyltransferase, YraL family [Lactobacillus crispatus 214-1]


My command python master_lacto_dict.py L_214.txt P_1_Results.txt P_1_Clustal.txt

import sys
from Bio import SeqIO

aa_db_file = sys.argv[1]  # Amino Acid Database file ~ 17 files
accession_id_file = sys.argv[2] # Accession IDs  file ~ 18 accession id numbers
file_for_clustal = sys.argv[3] # Output fasta file

wanted = set()
with open(accession_id_file) as f:
for line in f:
line = line.strip()
if line != "":

fasta_database = SeqIO.parse(open(aa_db_file),'fasta')
#fasta_database = Seq.IO.index("file_name", "fasta")  Also seen this in many examples

with open(file_for_clustal, "w") as f:
for seq in fasta_database:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")

#Desired output
#crispatus 214-1:seq

Python Fasta dictionary • 1.6k views
1
Entering edit mode

Hi, could you please specify your concrete question here (it seems there are a couple of them)?

Besides, as your question is rather of a "programming" nature, I would recommend to search for your problem on Google and especially at Stackoverflow (kind of biostars for programmers). From my experience, there is no question that has not yet been asked there, so if you can specify your concrete problem you should find a solution there.

Best,

Cindy

1
Entering edit mode

Well, if the desired information is always hidden inside these brackets and you can make sure that these are the only occurrences of those brackets, why don't you search for the occurrences of "[" and "]" to retrieve start and end index, and then get the corresponding substring of rec.description?

1
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

0
Entering edit mode

Hello,

I know have a for loop that seems to work

fasta_files = "L_214.txt","L_224-1.txt" # two files that are used as my database(s)
recs = [rec for f in fasta_files for rec in SeqIO.parse(f, 'fasta')]
print recs


Output:

id='EFB61523.1', name='EFB61523.1', description='EFB61523.1 cyclic nucleotide-binding domain protein, partial [Lactobacillus gasseri 224-1]', dbxrefs=[]), SeqRecord(seq=Seq('MLLVKILKNYYQAVFNKNADEVKVISDVLEKAGWKDFVSVLPHLNS', SingleLetterAlphabet())


How can I parse the recs? I want what is inside those brackets?

Lactobacillus gasseri 224-1:SeqRecord

1
Entering edit mode
import re

d = {}
fasta_files = "L_214.txt","L_224-1.txt"
for f in fasta_files:
for rec in SeqIO.parse(f, 'fasta'):
desc = rec.description
if '[' in desc:
organism = re.findall('$([^$]+)', desc)[0]
d[organism] = rec
print(d)

0
Entering edit mode

the re module is very useful in this case. Now I am curious how to iterates through my results files, which is basically a handful of text files, with one column of data.

How could I iterate through these files and check my master dictionary??

EEQ68960.1
EEQ26037.1
EFE00375.1
EFB62114.1
EEQ24341.1
EEJ40899.1
EEW51706.1

0
Entering edit mode

I was thinking about the glob function, but maybe there is an easier method

path = '/Users/sueparks/' + 'P_*' + '_Results.txt'
for file in glob.glob(path):
with open(file, 'r') as f:
print line