Question

Master dictionary of database file (2D dictionary) using Biopython, then specify key with string

0

Entering edit mode

6.4 years ago

incensefrenzie2006 • 0

My goal is to create a 2d dictionary, search for some sequence ids, and then writes the organism name with the amino acid sequence to a file

I have a working code that creates one dictionary, and looks for the id numbers that I want, and writes it to a file. I am unable to iterate through all my files, but it only returns the id number. I was looking to return the organism name as the key.

I have seen many examples on how to parse a single file as a dictionary to retrieve a dictionary as id:seq. Another example I seen seems to turn the header into a string, then split, but I am unsuccessful. Some of my headers have commas, and some do not. The examples I seen were splitting on ("|")

>EFE00375.1 S-adenosylmethionine-dependent methyltransferase, YraL family [Lactobacillus crispatus 214-1]

My command python master_lacto_dict.py L_214.txt P_1_Results.txt P_1_Clustal.txt

import sys
from Bio import SeqIO


aa_db_file = sys.argv[1]  # Amino Acid Database file ~ 17 files
accession_id_file = sys.argv[2] # Accession IDs  file ~ 18 accession id numbers
file_for_clustal = sys.argv[3] # Output fasta file

wanted = set()
with open(accession_id_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)


fasta_database = SeqIO.parse(open(aa_db_file),'fasta')
#fasta_database = Seq.IO.index("file_name", "fasta")  Also seen this in many examples



with open(file_for_clustal, "w") as f:
    for seq in fasta_database:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

#Desired output
#crispatus 214-1:seq

Python Fasta dictionary • 2.1k views

ADD COMMENT • link updated 6.4 years ago by cindy.perscheid ▴ 100 • written 6.4 years ago by incensefrenzie2006 • 0

1

Entering edit mode

Hi, could you please specify your concrete question here (it seems there are a couple of them)?

Besides, as your question is rather of a "programming" nature, I would recommend to search for your problem on Google and especially at Stackoverflow (kind of biostars for programmers). From my experience, there is no question that has not yet been asked there, so if you can specify your concrete problem you should find a solution there.

Best,

Cindy

ADD REPLY • link 6.4 years ago by cindy.perscheid ▴ 100

1

Entering edit mode

Well, if the desired information is always hidden inside these brackets and you can make sure that these are the only occurrences of those brackets, why don't you search for the occurrences of "[" and "]" to retrieve start and end index, and then get the corresponding substring of rec.description?

ADD REPLY • link 6.4 years ago by cindy.perscheid ▴ 100

1

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Hello,

I know have a for loop that seems to work

fasta_files = "L_214.txt","L_224-1.txt" # two files that are used as my database(s)
recs = [rec for f in fasta_files for rec in SeqIO.parse(f, 'fasta')]
print recs

Output:

id='EFB61523.1', name='EFB61523.1', description='EFB61523.1 cyclic nucleotide-binding domain protein, partial [Lactobacillus gasseri 224-1]', dbxrefs=[]), SeqRecord(seq=Seq('MLLVKILKNYYQAVFNKNADEVKVISDVLEKAGWKDFVSVLPHLNS', SingleLetterAlphabet())

How can I parse the recs? I want what is inside those brackets?

Lactobacillus gasseri 224-1:SeqRecord

ADD REPLY • link 6.4 years ago by incensefrenzie2006 • 0

1

Entering edit mode

import re

d = {}
fasta_files = "L_214.txt","L_224-1.txt"
for f in fasta_files:
    for rec in SeqIO.parse(f, 'fasta'):
        desc = rec.description
        if '[' in desc:
            organism = re.findall('\[([^\]]+)', desc)[0]
            d[organism] = rec
print(d)

ADD REPLY • link 6.4 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

the re module is very useful in this case. Now I am curious how to iterates through my results files, which is basically a handful of text files, with one column of data.

How could I iterate through these files and check my master dictionary??

EEQ68960.1
EEQ26037.1
EFE00375.1
EFB62114.1
EEQ24341.1
EEJ40899.1
EEW51706.1

ADD REPLY • link 6.3 years ago by incensefrenzie2006 • 0

0

Entering edit mode

I was thinking about the glob function, but maybe there is an easier method

path = '/Users/sueparks/' + 'P_*' + '_Results.txt'
for file in glob.glob(path):
    with open(file, 'r') as f:
        line = f.readlines()
        print line

ADD REPLY • link 6.3 years ago by incensefrenzie2006 • 0