How can i match between a file of IDs and a file of fasta sequences corresponding to the IDs using a python script?
1
0
Entering edit mode
5.3 years ago
yasmineo52 • 0

Hey guys! So i have 2 files: one contains a list of IDs (accessions numbers of Fasta sequences) and the other file contains the Fasta sequences corresponding to those IDs. I have a python script that works and matches he 2 files but only if the sequences file is short. When i change it with a longer file it doesn't give me the result that i'm looking for.

This is a sample of what the IDs file looks like:
A0A1B0GWI1
A2ABF9
A2ABF9
A2RUH7
ENSG00000255346
ENSG00000255346
ENSG00000255346
EBI-10021813
EBI-10021813

While the Fasta sequences file looks like this:

>tr|A0A1B0GWI1  |A0A1B0GWI1  _PLAFA Kinesin-like protein, putative OS=Plasmodium falciparum OX=5833 
GN=PocGH01_01021600 PE=3 SV=1
MHKRAYSESVALANRTLRRSDEKSFARELKAEDRIDEKQGHKLENVVVRIRKLEKNEESS
LHTDPNDKTTLYFNKDFSIEKYNFDMVFNENDNNEMIFKKIGGHLIVNNVCRGFKETVIT
YGQTGSGKTYTLFGSNKEYGIVYYFVYHLYKLCNFKNKKKTIYLSIYEILGDTLVDLISY
QNEKSIEFYTEEYYLKTIRYPYKVVNIKNYETAKKIIDTASSRSHAIIQFFVNISDSTRS
NGIETVRDYYGVLTLVDLVGCEREEFNTTKKEKSKDDKTSTKILNSSLTSLNKMLRKMQM
GNLDESDKRQSVLCKVLFNYIQKTCGVCLIFCFNPQMSQKSLTSSTLIMASECKKIKSKR
KQLIYVKSENKDAFFKKIANDSGKAGRCHGEYEEGRWDQNTSKQNGKDGKDGTVGTDGER
TREEKDDVSNNTTNASVIVVYANGERGNNVVNLSSEMGESEKYKSLKNIVREIIEEKGRE
EKKKNNLIQELKNDVVKLQKECAFWKRETHNYHNKLKVLNKNYIKMNEYLFNTLNNNSSN   
LCNSSFVKCENHTYGEWKSEHLAKKGNVLVHGGKYEQAKGGWRKDRQSNTEQGGCDIVHT
PHNAKDSDQKQTRSHHSPDLFTSDGTYNADNGIADSLYEKMETDNYFKKKNTTGVYQIDD
EYTLKREKSHNKLVPFDEKKKSEKECLLNNSPERKKYFRKVFTKELINYEQNSAHRENWE
KENDTPEHVREQRNDKKKPIYEKKKNSTIDAYNEHDTVFKKKNCVSNFVEKRENNDLNSH
QIVKNDVTVDIIRNKTNNSNEEPLLRNYQTNEDVDPSPYYNKMDTENVKRKNSEKGIIPT
DELTMNNAEATKMGSTLNIPSKSATMNYAPKSEALHMIPSEHFNHRHNVTVESLATKIKN
RILKSRSLSIAK   

>tr|A2ABF9_PLAFA Ribosome-recycling factor, putative OS=Plasmodium falciparum OX=5833 GN=RRF1 PE=4 SV=1
MVTISYSYCNAFIINSKCKRATYLFSGDSNVKRDLLTCWKRKYTSSNNRTKGDDYFSLHA
HKKKKSKKADAVEKMLKKKIITIREPNLNANEGNYSGVDSNHLVKKEYDEEHTEIANVPF
EKGKNKNDKNKKGIAKEEKQNSSFTLKNYKIRKNVANELVKTEENLEKSTAQSTAQSTAQ
STAKSTEKNTKKIFINALQNDDDITDTSSEEETTDETTEATEATEATEATEATEATEATE
ATEEDLHNLSSICEEKMNSVYNYIKKESYRFNINNVSSVMFEDEKIKINERIYKIKHICH
IKMKENLFTLTPYDPYFVNFIYIHLKKEYNEINVYIKNNSVYILIPPISENLKNELLIKI
KNKIENSKIILRNIRKNILHKLDILKKKISKDIYFKQKNYIQSLHDKTKKKIEHIFTELK

>tr|A2ABF9|A2ABF9 _PLAFA Uncharacterized protein OS=Plasmodium falciparum OX=5833 GN=PocGH01_12014600 PE=4 SV=1
MSFKEKLYRRKKCNEVVYEAYGKIDQEVKTSEKGVKGKNETEENLAKKLPFFLNKKKNEN
TCDLLMNVSEHDEDVYFISKTIKTYLNLNEPVRIRLSLLYNTAQYDLFRNYLIAIRDMYR
GIFALKVTRNGNIKKKKILFTTYNINIIGTWTRKIILYDEVTQVYISNSCTPELHIFEKK
FEDYVNRKNYIVIRTLYRDYSFLFLRDDEIVMKIKKKSAMGKIKHLLKKNELRGELKGEL
KNELIFKGHPGSELQDNTISTTGENCEDHLKNTHKGKLAMKDKIQDINATAQGFKFFFLR
ALRNSKTQSVDIDNEDVREMLKKNNMIQFDKYDFKNQKINSLFLFLQIMLDLCGPEIWFT
SKFDEILFTHLN

The script that i'm using is the following:

from Bio import SeqIO

fasta = SeqIO.parse("Sequences.fasta","fasta")

# read all the names
with open("IDs.txt", "r") as f:  # this takes care to close the file afterwards
  names = [line.strip().lstrip('>') for line in f]
print("IDs ", names)

for record in fasta:

    if record.id in names:
        print("Matchs:", record.id, record.seq, record.description)

The output should be like this:

A0A1B0GWI1
>tr|A0A1B0GWI1|A0A1B0GWI1  _PLAFA Kinesin-like protein, putative OS=Plasmodium falciparum OX=5833 
GN=PocGH01_01021600 PE=3 SV=1
MHKRAYSESVALANRTLRRSDEKSFARELKAEDRIDEKQGHKLENVVVRIRKLEKNEESS
LHTDPNDKTTLYFNKDFSIEKYNFDMVFNENDNNEMIFKKIGGHLIVNNVCRGFKETVIT
YGQTGSGKTYTLFGSNKEYGIVYYFVYHLYKLCNFKNKKKTIYLSIYEILGDTLVDLISY
QNEKSIEFYTEEYYLKTIRYPYKVVNIKNYETAKKIIDTASSRSHAIIQFFVNISDSTRS
NGIETVRDYYGVLTLVDLVGCEREEFNTTKKEKSKDDKTSTKILNSSLTSLNKMLRKMQM
GNLDESDKRQSVLCKVLFNYIQKTCGVCLIFCFNPQMSQKSLTSSTLIMASECKKIKSKR
KQLIYVKSENKDAFFKKIANDSGKAGRCHGEYEEGRWDQNTSKQNGKDGKDGTVGTDGER
TREEKDDVSNNTTNASVIVVYANGERGNNVVNLSSEMGESEKYKSLKNIVREIIEEKGRE
EKKKNNLIQELKNDVVKLQKECAFWKRETHNYHNKLKVLNKNYIKMNEYLFNTLNNNSSN   
LCNSSFVKCENHTYGEWKSEHLAKKGNVLVHGGKYEQAKGGWRKDRQSNTEQGGCDIVHT
PHNAKDSDQKQTRSHHSPDLFTSDGTYNADNGIADSLYEKMETDNYFKKKNTTGVYQIDD
EYTLKREKSHNKLVPFDEKKKSEKECLLNNSPERKKYFRKVFTKELINYEQNSAHRENWE
KENDTPEHVREQRNDKKKPIYEKKKNSTIDAYNEHDTVFKKKNCVSNFVEKRENNDLNSH
QIVKNDVTVDIIRNKTNNSNEEPLLRNYQTNEDVDPSPYYNKMDTENVKRKNSEKGIIPT
DELTMNNAEATKMGSTLNIPSKSATMNYAPKSEALHMIPSEHFNHRHNVTVESLATKIKN
RILKSRSLSIAK   

A2ABF9
>tr|A2ABF9_PLAFA Ribosome-recycling factor, putative OS=Plasmodium falciparum OX=5833 GN=RRF1 PE=4 SV=1
MVTISYSYCNAFIINSKCKRATYLFSGDSNVKRDLLTCWKRKYTSSNNRTKGDDYFSLHA
HKKKKSKKADAVEKMLKKKIITIREPNLNANEGNYSGVDSNHLVKKEYDEEHTEIANVPF
EKGKNKNDKNKKGIAKEEKQNSSFTLKNYKIRKNVANELVKTEENLEKSTAQSTAQSTAQ
STAKSTEKNTKKIFINALQNDDDITDTSSEEETTDETTEATEATEATEATEATEATEATE
ATEEDLHNLSSICEEKMNSVYNYIKKESYRFNINNVSSVMFEDEKIKINERIYKIKHICH
IKMKENLFTLTPYDPYFVNFIYIHLKKEYNEINVYIKNNSVYILIPPISENLKNELLIKI
KNKIENSKIILRNIRKNILHKLDILKKKISKDIYFKQKNYIQSLHDKTKKKIEHIFTELK

A2ABF9
>tr|A2ABF9|A2ABF9 _PLAFA Uncharacterized protein OS=Plasmodium falciparum OX=5833 GN=PocGH01_12014600 PE=4 SV=1
MSFKEKLYRRKKCNEVVYEAYGKIDQEVKTSEKGVKGKNETEENLAKKLPFFLNKKKNEN
TCDLLMNVSEHDEDVYFISKTIKTYLNLNEPVRIRLSLLYNTAQYDLFRNYLIAIRDMYR
GIFALKVTRNGNIKKKKILFTTYNINIIGTWTRKIILYDEVTQVYISNSCTPELHIFEKK
FEDYVNRKNYIVIRTLYRDYSFLFLRDDEIVMKIKKKSAMGKIKHLLKKNELRGELKGEL
KNELIFKGHPGSELQDNTISTTGENCEDHLKNTHKGKLAMKDKIQDINATAQGFKFFFLR
ALRNSKTQSVDIDNEDVREMLKKNNMIQFDKYDFKNQKINSLFLFLQIMLDLCGPEIWFT
SKFDEILFTHLN
sequence • 992 views
ADD COMMENT
0
Entering edit mode
5.3 years ago

Hello,

you have to iterate over your names list and check if the name is part of your record.id. Something like this:

for record in fasta:
    for n in names:
        if n in record.id:
            print("Matchs:", record.id, record.seq, record.description)
            break

If there is no need to use your own script, you should have a look at seqkit grep.

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 1415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6