I have two files: The first file is a .txt file that contains 3 columns but i am making use of column 2 and 3. The second file is a .fasta file that contain the sequences. Using python, I want to make each column 3 the file name and based on it compare the IDs in file1 and file2 and then use Biopython to write the sequences to the file I made(column 3)
file_1.txt:
009L_FRG3G  **Q6GZW6    3.6.4.-**
019R_FRG3G  **Q6GZV6    2.7.11.1**
044L_IIV3           Q197B6           2.7.11.-
055L_FRG3G  **Q6GZS1    3.6.4.-**
080R_IIV3       Q196Y0  3.6.1.-
088R_FRG3G   Q6GZN7         1.8.3.2
095L_IIV3   Q196W5           3.4.24.- ...
file_2.fasta
>sp|**Q6GZW6**|009L_FRG3G Putative helicase 009L OS=Frog virus 3
MDTSPYDFLKLYPWLSRGEADKGTLLDAFPGETFEQSLASDVAMRRAVQDDPAFGHQKLV
ETFLSEDTPYRELLLFHAPGTGKTCTVVSVAERAKEKGLTRGCIVLARGAALLRNFLHEL
>sp|    Q197B6|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
>sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus 3 
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR
IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL
>sp|**Q6GZV6**|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3 
MEEVDGCAGPNSEAGALTAGALTAGAFAVTAGAGVAGAGVAGVGWCSWCSWCSWCWCSWC
SWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCLSKGWEDRGGLEGCKSCKGWCL
>sp|**Q6GZS1**|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3
MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKITLPMGPQNAGDVTNKAYVDQA
VMSAAVPVASSTTVGTIQMAGDLEGSSGTNPIIAANKITLNKLQKIGPKMVIGNPNSDWN
...
My expected Output: Is to have a multiple files with the sequences that have the same EC
3.6.4.-.fasta
>sp|**Q6GZW6**|009L_FRG3G Putative helicase 009L OS=Frog virus 3
MDTSPYDFLKLYPWLSRGEADKGTLLDAFPGETFEQSLASDVAMRRAVQDDPAFGHQKLV
ETFLSEDTPYRELLLFHAPGTGKTCTVVSVAERAKEKGLTRGCIVLARGAALLRNFLHEL
>sp|**Q6GZS1**|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3 
MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKITLPMGPQNAGDVTNKAYVDQA
VMSAAVPVASSTTVGTIQMAGDLEGSSGTNPIIAANKITLNKLQKIGPKMVIGNPNSDWN
**2.7.11.1.fasta**
>sp|Q6GZV6|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3 
MEEVDGCAGPNSEAGALTAGALTAGAFAVTAGAGVAGAGVAGVGWCSWCSWCSWCWCSWC
SWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCWCSWCLSKGWEDRGGLEGCKSCKGWCL
ETC...
my problem is slightly complex to this solution: https://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file which only outputs in a single file.
My code so far:
`#!/usr/bin/python3
import os
from Bio import SeqIO
def get_accession(record):
        """given a seq_record, return the accession number as a         string $
        """
        parts = record.id.split("|")
        assert len(parts) == 3 and parts[0] == "sp"
        return parts[1]
records_dict = SeqIO.to_dict(SeqIO.parse("file_2", "fasta"), key_function=get_accession)
#intailize a dictionary
answer = {}
with open('file_1', 'r') as content:
#extracts AC1, EC from ID_AC.txt and makes it a dictionary
    for line in content:
        lines = line.split()
        answer[lines[1]] = lines[2]
#does the comparism and writing to the file here
records = SeqIO.parse("file_2.fasta", "fasta")
for seq_record in records:
    for key in records_dict:       #satisfies the condition that all key in file1 is in file2
        if key in answer: 
            EC = answer[key]
            eachEC = "".join(eachEC for eachEC in EC if eachEC.isalnum() or eachEC in ['','.', '-']).rstrip() + ".fasta"   #converts eachEC into a file name
            mode = 'a' if os.path.exists(eachEC) else 'w'
            if eachEC:
                with open(eachEC, mode) as fileinput:
                    fileinput.write(seq_record.format("fasta").strip())
                    fileinput.write(str(seq_record.seq) + "\n")`
Problem The problem with my script is that is creates the multiple files but copies the whole sequence in file_2 into them. Thanks. Am new to python
You want to put identical col 2 ids sequences from your fasta to a new fasta named by column 3? Your file2 fasta is also missing '>' for headers. If you can better format your question, we can help.
Please edit your code with the
101010button.