Hey guys! So i have 2 files: one contains a list of IDs (accessions numbers of Fasta sequences) and the other file contains the Fasta sequences corresponding to those IDs. I have a python script that works and matches he 2 files but only if the sequences file is short. When i change it with a longer file it doesn't give me the result that i'm looking for.
This is a sample of what the IDs file looks like:
A0A1B0GWI1
A2ABF9
A2ABF9
A2RUH7
ENSG00000255346
ENSG00000255346
ENSG00000255346
EBI-10021813
EBI-10021813
While the Fasta sequences file looks like this:
>tr|A0A1B0GWI1 |A0A1B0GWI1 _PLAFA Kinesin-like protein, putative OS=Plasmodium falciparum OX=5833
GN=PocGH01_01021600 PE=3 SV=1
MHKRAYSESVALANRTLRRSDEKSFARELKAEDRIDEKQGHKLENVVVRIRKLEKNEESS
LHTDPNDKTTLYFNKDFSIEKYNFDMVFNENDNNEMIFKKIGGHLIVNNVCRGFKETVIT
YGQTGSGKTYTLFGSNKEYGIVYYFVYHLYKLCNFKNKKKTIYLSIYEILGDTLVDLISY
QNEKSIEFYTEEYYLKTIRYPYKVVNIKNYETAKKIIDTASSRSHAIIQFFVNISDSTRS
NGIETVRDYYGVLTLVDLVGCEREEFNTTKKEKSKDDKTSTKILNSSLTSLNKMLRKMQM
GNLDESDKRQSVLCKVLFNYIQKTCGVCLIFCFNPQMSQKSLTSSTLIMASECKKIKSKR
KQLIYVKSENKDAFFKKIANDSGKAGRCHGEYEEGRWDQNTSKQNGKDGKDGTVGTDGER
TREEKDDVSNNTTNASVIVVYANGERGNNVVNLSSEMGESEKYKSLKNIVREIIEEKGRE
EKKKNNLIQELKNDVVKLQKECAFWKRETHNYHNKLKVLNKNYIKMNEYLFNTLNNNSSN
LCNSSFVKCENHTYGEWKSEHLAKKGNVLVHGGKYEQAKGGWRKDRQSNTEQGGCDIVHT
PHNAKDSDQKQTRSHHSPDLFTSDGTYNADNGIADSLYEKMETDNYFKKKNTTGVYQIDD
EYTLKREKSHNKLVPFDEKKKSEKECLLNNSPERKKYFRKVFTKELINYEQNSAHRENWE
KENDTPEHVREQRNDKKKPIYEKKKNSTIDAYNEHDTVFKKKNCVSNFVEKRENNDLNSH
QIVKNDVTVDIIRNKTNNSNEEPLLRNYQTNEDVDPSPYYNKMDTENVKRKNSEKGIIPT
DELTMNNAEATKMGSTLNIPSKSATMNYAPKSEALHMIPSEHFNHRHNVTVESLATKIKN
RILKSRSLSIAK
>tr|A2ABF9_PLAFA Ribosome-recycling factor, putative OS=Plasmodium falciparum OX=5833 GN=RRF1 PE=4 SV=1
MVTISYSYCNAFIINSKCKRATYLFSGDSNVKRDLLTCWKRKYTSSNNRTKGDDYFSLHA
HKKKKSKKADAVEKMLKKKIITIREPNLNANEGNYSGVDSNHLVKKEYDEEHTEIANVPF
EKGKNKNDKNKKGIAKEEKQNSSFTLKNYKIRKNVANELVKTEENLEKSTAQSTAQSTAQ
STAKSTEKNTKKIFINALQNDDDITDTSSEEETTDETTEATEATEATEATEATEATEATE
ATEEDLHNLSSICEEKMNSVYNYIKKESYRFNINNVSSVMFEDEKIKINERIYKIKHICH
IKMKENLFTLTPYDPYFVNFIYIHLKKEYNEINVYIKNNSVYILIPPISENLKNELLIKI
KNKIENSKIILRNIRKNILHKLDILKKKISKDIYFKQKNYIQSLHDKTKKKIEHIFTELK
>tr|A2ABF9|A2ABF9 _PLAFA Uncharacterized protein OS=Plasmodium falciparum OX=5833 GN=PocGH01_12014600 PE=4 SV=1
MSFKEKLYRRKKCNEVVYEAYGKIDQEVKTSEKGVKGKNETEENLAKKLPFFLNKKKNEN
TCDLLMNVSEHDEDVYFISKTIKTYLNLNEPVRIRLSLLYNTAQYDLFRNYLIAIRDMYR
GIFALKVTRNGNIKKKKILFTTYNINIIGTWTRKIILYDEVTQVYISNSCTPELHIFEKK
FEDYVNRKNYIVIRTLYRDYSFLFLRDDEIVMKIKKKSAMGKIKHLLKKNELRGELKGEL
KNELIFKGHPGSELQDNTISTTGENCEDHLKNTHKGKLAMKDKIQDINATAQGFKFFFLR
ALRNSKTQSVDIDNEDVREMLKKNNMIQFDKYDFKNQKINSLFLFLQIMLDLCGPEIWFT
SKFDEILFTHLN
The script that i'm using is the following:
from Bio import SeqIO
fasta = SeqIO.parse("Sequences.fasta","fasta")
# read all the names
with open("IDs.txt", "r") as f: # this takes care to close the file afterwards
names = [line.strip().lstrip('>') for line in f]
print("IDs ", names)
for record in fasta:
if record.id in names:
print("Matchs:", record.id, record.seq, record.description)
The output should be like this:
A0A1B0GWI1
>tr|A0A1B0GWI1|A0A1B0GWI1 _PLAFA Kinesin-like protein, putative OS=Plasmodium falciparum OX=5833
GN=PocGH01_01021600 PE=3 SV=1
MHKRAYSESVALANRTLRRSDEKSFARELKAEDRIDEKQGHKLENVVVRIRKLEKNEESS
LHTDPNDKTTLYFNKDFSIEKYNFDMVFNENDNNEMIFKKIGGHLIVNNVCRGFKETVIT
YGQTGSGKTYTLFGSNKEYGIVYYFVYHLYKLCNFKNKKKTIYLSIYEILGDTLVDLISY
QNEKSIEFYTEEYYLKTIRYPYKVVNIKNYETAKKIIDTASSRSHAIIQFFVNISDSTRS
NGIETVRDYYGVLTLVDLVGCEREEFNTTKKEKSKDDKTSTKILNSSLTSLNKMLRKMQM
GNLDESDKRQSVLCKVLFNYIQKTCGVCLIFCFNPQMSQKSLTSSTLIMASECKKIKSKR
KQLIYVKSENKDAFFKKIANDSGKAGRCHGEYEEGRWDQNTSKQNGKDGKDGTVGTDGER
TREEKDDVSNNTTNASVIVVYANGERGNNVVNLSSEMGESEKYKSLKNIVREIIEEKGRE
EKKKNNLIQELKNDVVKLQKECAFWKRETHNYHNKLKVLNKNYIKMNEYLFNTLNNNSSN
LCNSSFVKCENHTYGEWKSEHLAKKGNVLVHGGKYEQAKGGWRKDRQSNTEQGGCDIVHT
PHNAKDSDQKQTRSHHSPDLFTSDGTYNADNGIADSLYEKMETDNYFKKKNTTGVYQIDD
EYTLKREKSHNKLVPFDEKKKSEKECLLNNSPERKKYFRKVFTKELINYEQNSAHRENWE
KENDTPEHVREQRNDKKKPIYEKKKNSTIDAYNEHDTVFKKKNCVSNFVEKRENNDLNSH
QIVKNDVTVDIIRNKTNNSNEEPLLRNYQTNEDVDPSPYYNKMDTENVKRKNSEKGIIPT
DELTMNNAEATKMGSTLNIPSKSATMNYAPKSEALHMIPSEHFNHRHNVTVESLATKIKN
RILKSRSLSIAK
A2ABF9
>tr|A2ABF9_PLAFA Ribosome-recycling factor, putative OS=Plasmodium falciparum OX=5833 GN=RRF1 PE=4 SV=1
MVTISYSYCNAFIINSKCKRATYLFSGDSNVKRDLLTCWKRKYTSSNNRTKGDDYFSLHA
HKKKKSKKADAVEKMLKKKIITIREPNLNANEGNYSGVDSNHLVKKEYDEEHTEIANVPF
EKGKNKNDKNKKGIAKEEKQNSSFTLKNYKIRKNVANELVKTEENLEKSTAQSTAQSTAQ
STAKSTEKNTKKIFINALQNDDDITDTSSEEETTDETTEATEATEATEATEATEATEATE
ATEEDLHNLSSICEEKMNSVYNYIKKESYRFNINNVSSVMFEDEKIKINERIYKIKHICH
IKMKENLFTLTPYDPYFVNFIYIHLKKEYNEINVYIKNNSVYILIPPISENLKNELLIKI
KNKIENSKIILRNIRKNILHKLDILKKKISKDIYFKQKNYIQSLHDKTKKKIEHIFTELK
A2ABF9
>tr|A2ABF9|A2ABF9 _PLAFA Uncharacterized protein OS=Plasmodium falciparum OX=5833 GN=PocGH01_12014600 PE=4 SV=1
MSFKEKLYRRKKCNEVVYEAYGKIDQEVKTSEKGVKGKNETEENLAKKLPFFLNKKKNEN
TCDLLMNVSEHDEDVYFISKTIKTYLNLNEPVRIRLSLLYNTAQYDLFRNYLIAIRDMYR
GIFALKVTRNGNIKKKKILFTTYNINIIGTWTRKIILYDEVTQVYISNSCTPELHIFEKK
FEDYVNRKNYIVIRTLYRDYSFLFLRDDEIVMKIKKKSAMGKIKHLLKKNELRGELKGEL
KNELIFKGHPGSELQDNTISTTGENCEDHLKNTHKGKLAMKDKIQDINATAQGFKFFFLR
ALRNSKTQSVDIDNEDVREMLKKNNMIQFDKYDFKNQKINSLFLFLQIMLDLCGPEIWFT
SKFDEILFTHLN