Question: How can i match between a file of IDs and a file of fasta sequences corresponding to the IDs using a python script?
0
gravatar for yasmineo52
5 months ago by
yasmineo520
yasmineo520 wrote:

Hey guys! So i have 2 files: one contains a list of IDs (accessions numbers of Fasta sequences) and the other file contains the Fasta sequences corresponding to those IDs. I have a python script that works and matches he 2 files but only if the sequences file is short. When i change it with a longer file it doesn't give me the result that i'm looking for.

This is a sample of what the IDs file looks like:
A0A1B0GWI1
A2ABF9
A2ABF9
A2RUH7
ENSG00000255346
ENSG00000255346
ENSG00000255346
EBI-10021813
EBI-10021813

While the Fasta sequences file looks like this:

>tr|A0A1B0GWI1  |A0A1B0GWI1  _PLAFA Kinesin-like protein, putative OS=Plasmodium falciparum OX=5833 
GN=PocGH01_01021600 PE=3 SV=1
MHKRAYSESVALANRTLRRSDEKSFARELKAEDRIDEKQGHKLENVVVRIRKLEKNEESS
LHTDPNDKTTLYFNKDFSIEKYNFDMVFNENDNNEMIFKKIGGHLIVNNVCRGFKETVIT
YGQTGSGKTYTLFGSNKEYGIVYYFVYHLYKLCNFKNKKKTIYLSIYEILGDTLVDLISY
QNEKSIEFYTEEYYLKTIRYPYKVVNIKNYETAKKIIDTASSRSHAIIQFFVNISDSTRS
NGIETVRDYYGVLTLVDLVGCEREEFNTTKKEKSKDDKTSTKILNSSLTSLNKMLRKMQM
GNLDESDKRQSVLCKVLFNYIQKTCGVCLIFCFNPQMSQKSLTSSTLIMASECKKIKSKR
KQLIYVKSENKDAFFKKIANDSGKAGRCHGEYEEGRWDQNTSKQNGKDGKDGTVGTDGER
TREEKDDVSNNTTNASVIVVYANGERGNNVVNLSSEMGESEKYKSLKNIVREIIEEKGRE
EKKKNNLIQELKNDVVKLQKECAFWKRETHNYHNKLKVLNKNYIKMNEYLFNTLNNNSSN   
LCNSSFVKCENHTYGEWKSEHLAKKGNVLVHGGKYEQAKGGWRKDRQSNTEQGGCDIVHT
PHNAKDSDQKQTRSHHSPDLFTSDGTYNADNGIADSLYEKMETDNYFKKKNTTGVYQIDD
EYTLKREKSHNKLVPFDEKKKSEKECLLNNSPERKKYFRKVFTKELINYEQNSAHRENWE
KENDTPEHVREQRNDKKKPIYEKKKNSTIDAYNEHDTVFKKKNCVSNFVEKRENNDLNSH
QIVKNDVTVDIIRNKTNNSNEEPLLRNYQTNEDVDPSPYYNKMDTENVKRKNSEKGIIPT
DELTMNNAEATKMGSTLNIPSKSATMNYAPKSEALHMIPSEHFNHRHNVTVESLATKIKN
RILKSRSLSIAK   

>tr|A2ABF9_PLAFA Ribosome-recycling factor, putative OS=Plasmodium falciparum OX=5833 GN=RRF1 PE=4 SV=1
MVTISYSYCNAFIINSKCKRATYLFSGDSNVKRDLLTCWKRKYTSSNNRTKGDDYFSLHA
HKKKKSKKADAVEKMLKKKIITIREPNLNANEGNYSGVDSNHLVKKEYDEEHTEIANVPF
EKGKNKNDKNKKGIAKEEKQNSSFTLKNYKIRKNVANELVKTEENLEKSTAQSTAQSTAQ
STAKSTEKNTKKIFINALQNDDDITDTSSEEETTDETTEATEATEATEATEATEATEATE
ATEEDLHNLSSICEEKMNSVYNYIKKESYRFNINNVSSVMFEDEKIKINERIYKIKHICH
IKMKENLFTLTPYDPYFVNFIYIHLKKEYNEINVYIKNNSVYILIPPISENLKNELLIKI
KNKIENSKIILRNIRKNILHKLDILKKKISKDIYFKQKNYIQSLHDKTKKKIEHIFTELK

>tr|A2ABF9|A2ABF9 _PLAFA Uncharacterized protein OS=Plasmodium falciparum OX=5833 GN=PocGH01_12014600 PE=4 SV=1
MSFKEKLYRRKKCNEVVYEAYGKIDQEVKTSEKGVKGKNETEENLAKKLPFFLNKKKNEN
TCDLLMNVSEHDEDVYFISKTIKTYLNLNEPVRIRLSLLYNTAQYDLFRNYLIAIRDMYR
GIFALKVTRNGNIKKKKILFTTYNINIIGTWTRKIILYDEVTQVYISNSCTPELHIFEKK
FEDYVNRKNYIVIRTLYRDYSFLFLRDDEIVMKIKKKSAMGKIKHLLKKNELRGELKGEL
KNELIFKGHPGSELQDNTISTTGENCEDHLKNTHKGKLAMKDKIQDINATAQGFKFFFLR
ALRNSKTQSVDIDNEDVREMLKKNNMIQFDKYDFKNQKINSLFLFLQIMLDLCGPEIWFT
SKFDEILFTHLN

The script that i'm using is the following:

from Bio import SeqIO

fasta = SeqIO.parse("Sequences.fasta","fasta")

# read all the names
with open("IDs.txt", "r") as f:  # this takes care to close the file afterwards
  names = [line.strip().lstrip('>') for line in f]
print("IDs ", names)

for record in fasta:

    if record.id in names:
        print("Matchs:", record.id, record.seq, record.description)

The output should be like this:

A0A1B0GWI1
>tr|A0A1B0GWI1|A0A1B0GWI1  _PLAFA Kinesin-like protein, putative OS=Plasmodium falciparum OX=5833 
GN=PocGH01_01021600 PE=3 SV=1
MHKRAYSESVALANRTLRRSDEKSFARELKAEDRIDEKQGHKLENVVVRIRKLEKNEESS
LHTDPNDKTTLYFNKDFSIEKYNFDMVFNENDNNEMIFKKIGGHLIVNNVCRGFKETVIT
YGQTGSGKTYTLFGSNKEYGIVYYFVYHLYKLCNFKNKKKTIYLSIYEILGDTLVDLISY
QNEKSIEFYTEEYYLKTIRYPYKVVNIKNYETAKKIIDTASSRSHAIIQFFVNISDSTRS
NGIETVRDYYGVLTLVDLVGCEREEFNTTKKEKSKDDKTSTKILNSSLTSLNKMLRKMQM
GNLDESDKRQSVLCKVLFNYIQKTCGVCLIFCFNPQMSQKSLTSSTLIMASECKKIKSKR
KQLIYVKSENKDAFFKKIANDSGKAGRCHGEYEEGRWDQNTSKQNGKDGKDGTVGTDGER
TREEKDDVSNNTTNASVIVVYANGERGNNVVNLSSEMGESEKYKSLKNIVREIIEEKGRE
EKKKNNLIQELKNDVVKLQKECAFWKRETHNYHNKLKVLNKNYIKMNEYLFNTLNNNSSN   
LCNSSFVKCENHTYGEWKSEHLAKKGNVLVHGGKYEQAKGGWRKDRQSNTEQGGCDIVHT
PHNAKDSDQKQTRSHHSPDLFTSDGTYNADNGIADSLYEKMETDNYFKKKNTTGVYQIDD
EYTLKREKSHNKLVPFDEKKKSEKECLLNNSPERKKYFRKVFTKELINYEQNSAHRENWE
KENDTPEHVREQRNDKKKPIYEKKKNSTIDAYNEHDTVFKKKNCVSNFVEKRENNDLNSH
QIVKNDVTVDIIRNKTNNSNEEPLLRNYQTNEDVDPSPYYNKMDTENVKRKNSEKGIIPT
DELTMNNAEATKMGSTLNIPSKSATMNYAPKSEALHMIPSEHFNHRHNVTVESLATKIKN
RILKSRSLSIAK   

A2ABF9
>tr|A2ABF9_PLAFA Ribosome-recycling factor, putative OS=Plasmodium falciparum OX=5833 GN=RRF1 PE=4 SV=1
MVTISYSYCNAFIINSKCKRATYLFSGDSNVKRDLLTCWKRKYTSSNNRTKGDDYFSLHA
HKKKKSKKADAVEKMLKKKIITIREPNLNANEGNYSGVDSNHLVKKEYDEEHTEIANVPF
EKGKNKNDKNKKGIAKEEKQNSSFTLKNYKIRKNVANELVKTEENLEKSTAQSTAQSTAQ
STAKSTEKNTKKIFINALQNDDDITDTSSEEETTDETTEATEATEATEATEATEATEATE
ATEEDLHNLSSICEEKMNSVYNYIKKESYRFNINNVSSVMFEDEKIKINERIYKIKHICH
IKMKENLFTLTPYDPYFVNFIYIHLKKEYNEINVYIKNNSVYILIPPISENLKNELLIKI
KNKIENSKIILRNIRKNILHKLDILKKKISKDIYFKQKNYIQSLHDKTKKKIEHIFTELK

A2ABF9
>tr|A2ABF9|A2ABF9 _PLAFA Uncharacterized protein OS=Plasmodium falciparum OX=5833 GN=PocGH01_12014600 PE=4 SV=1
MSFKEKLYRRKKCNEVVYEAYGKIDQEVKTSEKGVKGKNETEENLAKKLPFFLNKKKNEN
TCDLLMNVSEHDEDVYFISKTIKTYLNLNEPVRIRLSLLYNTAQYDLFRNYLIAIRDMYR
GIFALKVTRNGNIKKKKILFTTYNINIIGTWTRKIILYDEVTQVYISNSCTPELHIFEKK
FEDYVNRKNYIVIRTLYRDYSFLFLRDDEIVMKIKKKSAMGKIKHLLKKNELRGELKGEL
KNELIFKGHPGSELQDNTISTTGENCEDHLKNTHKGKLAMKDKIQDINATAQGFKFFFLR
ALRNSKTQSVDIDNEDVREMLKKNNMIQFDKYDFKNQKINSLFLFLQIMLDLCGPEIWFT
SKFDEILFTHLN
sequence • 233 views
ADD COMMENTlink modified 5 months ago by finswimmer11k • written 5 months ago by yasmineo520
0
gravatar for finswimmer
5 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello,

you have to iterate over your names list and check if the name is part of your record.id. Something like this:

for record in fasta:
    for n in names:
        if n in record.id:
            print("Matchs:", record.id, record.seq, record.description)
            break

If there is no need to use your own script, you should have a look at seqkit grep.

fin swimmer

ADD COMMENTlink modified 5 months ago • written 5 months ago by finswimmer11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2064 users visited in the last hour