Sorting and writing multifasta entries to new fasta files
2
1
Entering edit mode
2.6 years ago
lachiemck • 0

Hi, first post here. So I'm trying take the CDS out of various species' orthologous sequences. I'm running on a Linux server, and am mainly aiming to use BioPython or Linux programs for this.

I've run OrthoFinder on 28 species of seaweed, which gave out roughly 10,000 orthogroup sequences fasta files, each of which is a a multi-fasta file. I've concatenated each of them into one huge multifasta file, and now I want to extract the fasta files according to their species into a new multifasta file (so 10k files -> 1 file -> 28 files, one per species).

How do I do this? I'm still fairly new to BioPython, so I'm still wrapping my head around things. I know I'll definitely need SeqIO, not sure what other libraries I'll need. I already have a text file with all the species listed, one per line.

Thanks heaps for any help. Lachlan

BioPython OrthoFinder fasta • 3.0k views
ADD COMMENT
1
Entering edit mode

Thank you for explaining the issue in detail. It would help better if you could post an example input file and expected outcome file.

ADD REPLY
0
Entering edit mode

Hi, thanks for the reply. What I would want to do with that is sort each fasta by species name, then output it so that each fasta sequence for each species is outputted to its own fasta file. Note that there will be multiple sequences of various lengths for each species (I'm talking tens of thousands of sequences all up in the end).

Thanks.

Here's an example part: Edit: just noticed Biostars doesn't like copy-pasting fasta file. Idk how to fix that.

>scaffold-BWVJ-2021190-Betaphycus_philippinensis
MESIIALEGDGYVLIAADVTSARSVVVMKTDLDKIRALDNHKLFAAAGVPGDVTKFTEHVQKDVRLYNMRSGITMSTAAVANYTRGELAKFLRKAPYQCNVLIGGFDHAPTGDGASLYSCDYLGTLHKLKFAAEGYAQYFVLSTLDRYWKKNLTLGEGVEVIGKCVAEIQKRLVINQPKFCIKVVDKDGVR
>scaffold-CKXF-2021055-Ahnfeltiopsis_flabelliformis
MESIIALQGDGYILMAADVTSARSVVVMKHDMDKIRPLDETKLFAAAGVPGDVSKFSEHVQKDVRLYNMRSGITMSTAAVANYTRGELARFLRKSPYQCNILIGGYDPQPNGDGPSLYSCDYLGTLHKLTFAAEGYAQYFVLSTLDRYWKKNLSLDEGLAVIRKCIEEVQKRLVINQPRFSIKVVSKDG
>scaffold-IEHF-2016702-Dumontia_simplex
MESLLALEGDGYVLIAADVANARSVVVMKDDMDKIRPLDATKLFAGAGAPGDVSKFTEHVQKDVRLYTLRSGITMSTAAVAHYTRGELAKALRKAPYQCNVLIAGYDAPPNDEGPSLYSCDYLGTLHKLSFAAEGYAQYFTLSTMDRYWKKNLSLDEGLAIIRKCIAEVQKRLVINQPRFCIKVVSKDG
>scaffold-IHJY-2053086-Kappaphycus_alvarezii
MESIIALEGDDYVLIAADVTSARSVVVMKTDLDKIRALDNHKLFAAAGVPGDVTKFTEHVQKDVRLYNMRSGITMSTAAVVNYTRGELAKFLRKAPYQCNVLIGGYDHAPTGNGASLYSCDYLGTLHKLKFAAEGYAQYFVLSTLDRYWKKNLTLDE
>scaffold-IKIZ-2013483-Grateloupia_livida
MESIIALEGDGYILIAADVASARSIVVMKDDMDKIRPLDSHKLFAAAGIPGDVSKFTEHVQRDVRLYNMRSGITMSTAAVANYTRGELARFLRRSPFQCNVLIGGYDAAPFGNGPSLYSCDYLGTLTKLKFAAEGYAQYFVLSTLDRYWKKNLSVEEGVEVIKKCVAEVQKRLVINQPRFAIKVVDKNGVRAID
>scaffold-IKWM-2003998-Gracilaria_lemaneiformis
MESVIALEGDGFVIIAADVSNARSIVVMKDDVDKIRVLDDHKLFAAVGDPGDVSKFFEHIQKDVKLYNMRSGITISTAAMANYTRGELARFLRRSPFQCNVVMAGYDPAPNGAGPSLYTCDYLGTLAKLKFAAEGYAQYFVLSTLDRYWKKNMSVEDGLAVIKKCIAEVQTRFVVSQRRFAIKVVSKDGVK
>scaffold-JEBK-2023344-Eucheuma_denticulatum
MESIIALEGDGYVLIAADVASARSVVVMKTDLDKIRALDDYKLFAAAGVPGDVTKFTEHVQKDVRLYNMRSGITMSTAAVANYTRGELAKFLRKSPYQCNVLIGGYDHAPTGDGASLYSCDYLGTLHKLKFAAEGYAQYFVLSTLDRYWKKNLSLDDGVEVIGKCIAEIQKRLVINQPKFCIKVVDKDGVR
>scaffold-JJZR-2008250-Rhodochaete_parvula
MDSIISLVGGDFVLSAADTGHAQSVVVMKQDMDKIMALDEHKILSIAGEWGDAVQFTEYVQKNVHLYELRTGITMSTPAVANYTRNWLAKSIRSNPYNVNLLLGGWDKTTGPSLYFLDYLGTCHPMKYSAQGYASFFVQSTLDRHWREGMSLDEALDVMRKCIAEVSMRFVINQPSFTAKVVDKDGVR
>scaffold-LJPN-2000184-Gracilaria_blodgettii
MESVIALEGDGYVMIAADVSSARSIVVMKDDVDKIRALDHQKLFAAVGSPGDVSKFCEHIQKDVQLYNMRSGITMSTAAMANYTRGELARFLRRSPFQCNILMAGYDAPPNGNAASLYSCDYLGTLVKLKFAAEGYAQYFVLSTLDRYWKKNLSVDDGLQIIKKCIAEIQKRLVISQPRFSIKMVSKDG
>scaffold-LJPN-2016751-Gracilaria_blodgettii
DQSATRSILVYKDDEDKMVQLDDFKVAAGNGPLSDRAEFFEYLQKNMKLYQLRNGITLKGHAAANFMRGEMATMLRSNPKSVNVLLGTVDKEEGGVAPALYWMDYISSLAKVNYGAHGYGAHFCLGIFDRYWKPDLTQEEAVKILRLCRNELDERFL
>scaffold-LLXJ-2039348-Chroodactylon_ornatum
ETILGVVGKDFVMVLADKSAARSILAFKHDEDKISKLDEHKVVAACGETADRTAFTEYVQRNMALDEFRTGLRRTTDATAHFIRGELATALRKSPYVSLLLAGFDDISGGKAGAEKTEAEAVGKESKGEASATEASTSGVGPSLYWMDYLGTMQRVNYGAHGYAAFFSTSTMDRYWKPGMTEEEAADLLATCVAQLKTRFIIHQPNFTVRVVSASGVKD
>scaffold-OBUY-2017628-Porphyridium_cruentum
GDGFVMCAADMTNARSIVVMKEDMDKIMELDRHRLLCMAGEPGDVAQFTEYVQKNVHLYQLRTGVSQSTRAIANFTRNELAKSLRKNPYSVNLLLGGYDQHDGPEVFYLDYLGTLHKMPFSAQGYCAYFILATLDRYYKPNMSEQEALEVMRKAIDEVRIRFLIKQPDFLIKVVDKNGIRTVS
>scaffold-PVGP-2018385-Porphyridium_purpureum
NVHLYQLRTGVSQSTRAIANFTRNELAKSLRKNPYSVNLLLGGYDQHDGPEVFYLDYLGTLHKMPFSAQGYCAYFILATLDRYYKPNMSEQEALEVMRKAIDEVRIRFLIKQPDFLIKVVDKNGIRTVS
>scaffold-PWKQ-2004203-Gracilaria_sp.
MESVIALEGDGYVMIAADVSNARSIVVMKDDVDKIRALDHQKLFAAVGTPGDVSKFCEHIQKDVRLYNMRSGITMSTAAMANYTRGELARFLRRSPFQCNILIGGYDAPPNGTGASLYSCDYLGTLVKLKFAAEGYAQYFVLSTLDRYWKKNLSVDEGLEIIKKCVAEIQKRLVISQPRFAIKMVSKDG
>scaffold-PYDB-2022840-Grateloupia_catenata
MECSLAMTFGDFALVASDATNARSILVMKEDYDKCFRLSDSLLMSATGEAGDTAQFAEYIAKSLQLYRMRNSYELSPKAAATFTRRNLADYLRSRTPYMVNLIIVGFDKEQSTCEMYYMDYLASMVKVPYGAHGYGGFFTTALMDRHYRPDMNREEAYQLMKDCVQEIHKRLIVNLPTFKVQLVDKDGIKD
>scaffold-PYDB-2024127-Grateloupia_catenata
MESIIALEGDGYVMMAADVASARSVVVMKDDMDKIWPLDSHKLFAAAGIPGDVSKFTEHVQRDVRLYNMRSGITMSTAAVANYTRGELARFLRRSPFQCNVLIGGYDAPSYGACPSLYSCDYLGTLTKLKFAAEGYAQYFVLSTLDRYWKKNLSVEEGVDVLKKCIAEVQKRLVINQPRFAIKVVDKDGVR
>scaffold-RSOF-2002228-Glaucosphaera_vacuolata
MDSLISLVGDEFVLSAADTNNARSILVMKDDLDKIMHLDDHKLLSVAGEQGDAVYFTEYIQKNSHLYALRTGIPLTTDALANYTRGELAKFLRKSPYAVNLLLAGYDAATGPALYYLDYLGTLLKTTYTAQGYASYFVLATMDRYWKKGMNEADAVELMRKCIAEVKQRFLINQPSLFM
>scaffold-SBLT-2000922-Gloiopeltis_furcata
MESIIALQGADYVLIAADVSSARSVVVMKDDMDKIRVLDSHKLFAAAGVPGDVSKFSEHIQKDVRLYNMRSGITMSTAAVANYTRGELARFLRKSPYQCNVLIGGYDEGNAEGKGPSLYSCDYLGTLHRLSFAAEGYAQYFVLSTLDRYWQKGMGVEQGVEVVKKCIREVQKRLVINQPRFVIKVVGKDG
>scaffold-UGPM-2023043-Chondrus_crispus
MESIIALEGRDYVLIAADVSSARSVVVMKDDMDKIRALDSHKLFAAAGPPGDVCKFSEHVQKDVRLYNMRSGITMSTAAVANYTRGELARLLRKAPYQCNVLIGGFDAAPHGTGPALYSCDYLGTLHRLRFAAEGYAQYFVLSTMDRYWRKGMAVDEGVDVVRRCIAEVQKRLVINQPRFVIKVVDKDGVR
>scaffold-URSB-2000329-Grateloupia_turuturu
MESIIALEGDGYVLIAADVASARSVVVMKDDMDKIRPLDSHKLFAAAGPFTEHVQRDVRLYNMRSGITMSTAAVANYTRGELARFLRRSPFQCNVLIGGYDAPPFGQGASLYSCDYLGTLTKLKFAAEGYAQYFVLSTLDRYWKKNLSVEEGVEVIKKCVAEVQKRLVINQPRFAIKVVDKDGVR
>scaffold-URSB-2000330-Grateloupia_turuturu
MESIIALEGDGYVLIAADVASARSVVVMKDDMDKIRPLDSHKLFAAAGPFTEHVQRDVRLYNMRSGITMSTAAVANYTRGELARFLRRSPFQCNVLIGGYDAPPFGQGASLYSCDYLGTLTKLKFAAEGYAQYFVLSTLDRYWKKNLSVEEGVEVIKKCVAEVQKRLVINQPRFAIKVVDKDGVR
>scaffold-VZWX-2020826-Ceramium_kondoi
MESIIALQGADYVLIAADVTSARSIVVMKRDADKIRTLDDNKLLAAAGVPGDVTKFVEHVQQDVSLYTLRSGIAMSTAAVAHYTRNELARFLRRSPFQCNVLLGGVDVAPNGSGPSLYSIDYLGTMAKLPFAVEGYAQYFLLGTMDRYWKKNMSLDDGLAVVRKCVDEIQQRLIINQPRFCIKVVTKDGVK
>scaffold-WEJN-2024405-Mazzaella_japonica
MESIIALQGRDYVLIAADMSSARSVVVMKDDMDKIRALDSHKLFAAAGTPGDVCKFSEHVQKDVRLYNMRSGITMSTAAVANYTRGELARLLRKGPYQCNVLIGGFDAAPHGTGPALYSCDYLGTLHRLSFAAEGYAQYFVLSTMDRYWRKGMGVDEGVGVVRRCIAEVQKRLVINQPRFVIKVVDKDGVR
>scaffold-XAXW-2022270-Neosiphonia_japonica
MESIIALQGDGYVLMAADASSARSIVVMKDDMDKIKSLDDQKLFAAAGVPGDVTKFTEHVQKDVRLYTLRSGITMSTAAVANYTRNELARFLRKSPFQCNVLLGGYDSAPNGEGPSLYSCDYLGTLAKLQFAAEGYAQYFVLSTMDRYWKKNLSLEDGLSVMKKCIAEIQKRLVISNPHFSIKVVTKDGIKEI
>scaffold-YSBD-2038985-Heterosiphonia_pulchra
MESIIALQGDGYVMIAADVSAARSIVVMKDDMDKIRALDDSKLFAAAGVPGDVTKFTEHVQKDVRLYTLRSGISMSTAALANYTRGELARLLRRSPFQCNVILGGYDAEPNGNGPSLYSCDYLGTLTKLTFAVEGYAQFFTLSTMDRYWKKNMSVDEGVDVIRKCIAEVQKRFLVNQPRFSIKMVSKDGVK
>scaffold-ZJOJ-2006011-Grateloupia_filicina
MVVVFGLTGNDFALVVADMTSARSIMCFKHDEDKIERIDERKVLATAGEHSNRIEFSEYIQKNLALMKLQTGLELSNHGTANFIRNEVAKALRTRGAYNTNSIMAGFDETGPAQKVNFTAHGYASYFSLSVMDSKWRQDMTLEEGKKLVQECIDQLKSRFLINQPKFMMKIVTDQGITE
>scaffold-ZJOJ-2006903-Grateloupia_filicina
MDTLLGIAGEGFVVLAADAQVARSILLYKNDMDKIAHLSENKALACAGPQSDCVSFTEYISKNMALYELNNDVKLSTKAAASFIRGELAKALRKGPFQTQILMGGVDKRAAAEAEGKDDASLFWLDYLGTLQKVPYGAHGYGAAFTLSVMDREYVKGLSLDEALAIIDNCIKELHTRFLIAQKNFVIKVVTAEGIK
>scaffold-ZJOJ-2055484-Grateloupia_filicina
MESIIALEGDGYVLIAADVASARSVVVMKDDMDKIRPLDSHKLFAAAGIPGDVSKFTEHVQKDVRLYNMRSGITMSTAAAANYTRGELARFLRRSPFQCNVLIGGYDAPPYGHGPSLYSCDYLGTLTKLKFAAEGYAQYFVLSTLDRYWKKNLSIEDGVEVIKKCVAEVQKRLVINQPRFAIKIVDKNGVRVID
>scaffold-ZULJ-2003903-Pyropia_yezoensis
MDSLIAISGRDFVLMASDVTSARSIVVMKEDMDKIMELDEHKLLGFAGEPGDCTAFTEYIQKNVHLFALRSGITLDTHAVGNFTRNELAVALRKRPYNANMLLAGYDEHVGPSLYYLDYLATLHKMDFSALGYASFFVLSTLDRHWKKNMSVDEALVVLKKCIKEVQTRMIISQPKFTIKLVGKDGIQVLEAASSVAD
>scaffold-ZULJ-2003905-Pyropia_yezoensis
MDSLIAISGRDFVLMASDVTSARSIVVMKEDMDKIMELDEHKLLGFAGEPGDCTAFTEYIQKNVHLFALRSGITLDTHAVGNFTRNELAVALRKRPYNANMLLAGYDEHVGPSLYYLDYLATLHKMDFSALGYASFFVLSTLDRHWKKNMSVDEALVVLKKCIKEVQTRMIISQPKFTIKLVGKDGIQVLEAASSVAD
ADD REPLY
0
Entering edit mode

So update on the problem: Got it half working. Here's my current code:

from Bio import SeqIO

sequence = SeqIO.parse("OG0000036.fa", "fasta")

Ahnfeltiopsis = open("Ahnfeltiopsis_flabelliformis.txt", "w")

Betaphycus = open("Betaphycus_philippinensis.txt", "w")


Ahnfeltiopsis_flabelliformis = 'Ahnfeltiopsis_flavelliformis'
Betaphycus_philippinensis = 'Betaphycus_philippinensis'


for x in sequence:
    if Betaphycus_philippinensis in x.id:
        SeqIO.write(x, Betaphycus, "fasta")
    elif Ahnfeltiopsis_flabelliformis in x.id:
        SeqIO.write(x, Ahnfeltiopsis, "fasta")
    else:
        continue

Ahnfeltiopsis.close()
Betaphycus.close()

As it stands, it places all the Betaphycus reads into the Betaphycus file, but skips over Ahnfeltiopsis. Not sure how to get past that. I've tried leaving the end blank and using else: continue. Not sure how to proceed from here.

Thanks

ADD REPLY
0
Entering edit mode

You have a typo in 'Ahnfeltiopsis_flavelliformis'.

ADD REPLY
0
Entering edit mode

Hi everyone,

Thanks heaps for the help. I managed to get some code working, but was bloated. I've tried out SeqKit and it seems quite good.

a.zielezinski, your code was also really good, and worked perfect. Thanks for that!

Lachlan

ADD REPLY
4
Entering edit mode
2.6 years ago

Here is a Python code that reads one FASTA file and creates multiple FASTA files for each species separately.

from Bio import SeqIO

d = {}
fh = open('OG0000036.fa')
for seq_record in SeqIO.parse(fh, 'fasta'):
    species_name = seq_record.id.split('-')[-1]
    if species_name not in d:
        d[species_name] = open(f"{species_name}.fa", 'w')
    d[species_name].write(seq_record.format("fasta"))
fh.close()

OUTPUT:

FASTA files:

Ahnfeltiopsis_flabelliformis.fa
Betaphycus_philippinensis.fa
Ceramium_kondoi.fa
Chondrus_crispus.fa
Chroodactylon_ornatum.fa
Dumontia_simplex.fa
Eucheuma_denticulatum.fa
Glaucosphaera_vacuolata.fa
Gloiopeltis_furcata.fa
Gracilaria_blodgettii.fa
Gracilaria_lemaneiformis.fa
Gracilaria_sp..fa
Grateloupia_catenata.fa
Grateloupia_filicina.fa
Grateloupia_livida.fa
Grateloupia_turuturu.fa
Heterosiphonia_pulchra.fa
Kappaphycus_alvarezii.fa
Mazzaella_japonica.fa
Neosiphonia_japonica.fa
Porphyridium_cruentum.fa
Porphyridium_purpureum.fa
Pyropia_yezoensis.fa
Rhodochaete_parvula.fa

For example, Grateloupia_filicina.fa has three sequences:

>scaffold-ZJOJ-2006011-Grateloupia_filicina
MVVVFGLTGNDFALVVADMTSARSIMCFKHDEDKIERIDERKVLATAGEHSNRIEFSEYI
QKNLALMKLQTGLELSNHGTANFIRNEVAKALRTRGAYNTNSIMAGFDETGPAQKVNFTA
HGYASYFSLSVMDSKWRQDMTLEEGKKLVQECIDQLKSRFLINQPKFMMKIVTDQGITE
>scaffold-ZJOJ-2006903-Grateloupia_filicina
MDTLLGIAGEGFVVLAADAQVARSILLYKNDMDKIAHLSENKALACAGPQSDCVSFTEYI
SKNMALYELNNDVKLSTKAAASFIRGELAKALRKGPFQTQILMGGVDKRAAAEAEGKDDA
SLFWLDYLGTLQKVPYGAHGYGAAFTLSVMDREYVKGLSLDEALAIIDNCIKELHTRFLI
AQKNFVIKVVTAEGIK
>scaffold-ZJOJ-2055484-Grateloupia_filicina
MESIIALEGDGYVLIAADVASARSVVVMKDDMDKIRPLDSHKLFAAAGIPGDVSKFTEHV
QKDVRLYNMRSGITMSTAAAANYTRGELARFLRRSPFQCNVLIGGYDAPPYGHGPSLYSC
DYLGTLTKLKFAAEGYAQYFVLSTLDRYWKKNLSIEDGVEVIKKCVAEVQKRLVINQPRF
AIKIVDKNGVRVID
ADD COMMENT
0
Entering edit mode

Would this code work well in instances where there are multiple instances of the same species? For instance later I might have Kappaphycus_alvarezii_std, Kappaphycus_alvarezii_low, etc. Would it still work like that?

I'm guessing the seq_record.id.split('-')[-1] command splits the sections based on the - character, and takes the final section? Is that correct?

ADD REPLY
0
Entering edit mode

You are correct - the code will create two separate files for Kappaphycus_alvarezii_std and Kappaphycus_alvarezii_low. However, I can modify the code to get only the first two words from the full species name. For example, Kappaphycus_alvarezii_std and Kappaphycus_alvarezii_low would end up as one species Kappaphycus_alvarezii, and all their sequences will be saved to the Kappaphycus_alvarezii.fa file. Would that be ok?

ADD REPLY
0
Entering edit mode

It's alright thanks. I would want to be keeping them separate. Thank you for the offer though.

ADD REPLY
0
Entering edit mode
2.6 years ago

with seqkit:

$ seqkit -w 0 split -i --id-regexp '.*-(\w+_\w+)\.*$' test.fa -O new_files

Assuming that all fasta headers follow same pattern, please run above code. New files would be in "new_files" folder (you can change it) and each new fasta file starts with input_id, followed by species name (input.id_Pyropia_yezoensis.fasta) and within in each fasta, you would find all sequences pertaining to that species (from header).

with awk and flattened fasta sequences (sequences in single line for each record):

$ awk -F '[-]' '/>/{getline seq; print $0"\n"seq > "new_folder/"$4".fa"}' input.fa

Create a folder by name new_folder in the same directory as input.fa before code execution . All new sequences will appear in new_folder.

Take a back up of your fasta file before you execute the code

fasta stats from awk split:

file                             format  type     num_seqs  sum_len  min_len  avg_len  max_len
Ahnfeltiopsis_flabelliformis.fa  FASTA   Protein         1      189      189      189      189
Betaphycus_philippinensis.fa     FASTA   Protein         1      191      191      191      191
Ceramium_kondoi.fa               FASTA   Protein         1      191      191      191      191
Chondrus_crispus.fa              FASTA   Protein         1      191      191      191      191
Chroodactylon_ornatum.fa         FASTA   Protein         1      219      219      219      219
Dumontia_simplex.fa              FASTA   Protein         1      189      189      189      189
Eucheuma_denticulatum.fa         FASTA   Protein         1      191      191      191      191
Glaucosphaera_vacuolata.fa       FASTA   Protein         1      179      179      179      179
Gloiopeltis_furcata.fa           FASTA   Protein         1      190      190      190      190
Gracilaria_blodgettii.fa         FASTA   Protein         2      346      157      173      189
Gracilaria_lemaneiformis.fa      FASTA   Protein         1      191      191      191      191
Gracilaria_sp..fa                FASTA   Protein         1      189      189      189      189
Grateloupia_catenata.fa          FASTA   Protein         2      382      191      191      191
Grateloupia_filicina.fa          FASTA   Protein         3      569      179    189.7      196
Grateloupia_livida.fa            FASTA   Protein         1      194      194      194      194
Grateloupia_turuturu.fa          FASTA   Protein         2      370      185      185      185
Heterosiphonia_pulchra.fa        FASTA   Protein         1      191      191      191      191
Kappaphycus_alvarezii.fa         FASTA   Protein         1      157      157      157      157
Mazzaella_japonica.fa            FASTA   Protein         1      191      191      191      191
Neosiphonia_japonica.fa          FASTA   Protein         1      193      193      193      193
Porphyridium_cruentum.fa         FASTA   Protein         1      183      183      183      183
Porphyridium_purpureum.fa        FASTA   Protein         1      129      129      129      129
Pyropia_yezoensis.fa             FASTA   Protein         2      287       89    143.5      198
Rhodochaete_parvula.fa           FASTA   Protein         1      188      188      188      188
ADD COMMENT

Login before adding your answer.

Traffic: 1478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6