My goal is to create a 2d dictionary, search for some sequence ids, and then writes the organism name with the amino acid sequence to a file
I have a working code that creates one dictionary, and looks for the id numbers that I want, and writes it to a file. I am unable to iterate through all my files, but it only returns the id number. I was looking to return the organism name as the key.
I have seen many examples on how to parse a single file as a dictionary to retrieve a dictionary as id:seq. Another example I seen seems to turn the header into a string, then split, but I am unsuccessful. Some of my headers have commas, and some do not. The examples I seen were splitting on
>EFE00375.1 S-adenosylmethionine-dependent methyltransferase, YraL family [Lactobacillus crispatus 214-1]
My command python master_lacto_dict.py L_214.txt P_1_Results.txt P_1_Clustal.txt
import sys from Bio import SeqIO aa_db_file = sys.argv # Amino Acid Database file ~ 17 files accession_id_file = sys.argv # Accession IDs file ~ 18 accession id numbers file_for_clustal = sys.argv # Output fasta file wanted = set() with open(accession_id_file) as f: for line in f: line = line.strip() if line != "": wanted.add(line) fasta_database = SeqIO.parse(open(aa_db_file),'fasta') #fasta_database = Seq.IO.index("file_name", "fasta") Also seen this in many examples with open(file_for_clustal, "w") as f: for seq in fasta_database: if seq.id in wanted: SeqIO.write([seq], f, "fasta") #Desired output #crispatus 214-1:seq