Hi, first post here.
So I'm trying take the CDS out of various species' orthologous sequences. I'm running on a Linux server, and am mainly aiming to use BioPython or Linux programs for this.
I've run OrthoFinder on 28 species of seaweed, which gave out roughly 10,000 orthogroup sequences fasta files, each of which is a a multi-fasta file. I've concatenated each of them into one huge multifasta file, and now I want to extract the fasta files according to their species into a new multifasta file (so 10k files -> 1 file -> 28 files, one per species).
How do I do this? I'm still fairly new to BioPython, so I'm still wrapping my head around things. I know I'll definitely need SeqIO, not sure what other libraries I'll need. I already have a text file with all the species listed, one per line.
Hi, thanks for the reply.
What I would want to do with that is sort each fasta by species name, then output it so that each fasta sequence for each species is outputted to its own fasta file. Note that there will be multiple sequences of various lengths for each species (I'm talking tens of thousands of sequences all up in the end).
Thanks.
Here's an example part:
Edit: just noticed Biostars doesn't like copy-pasting fasta file. Idk how to fix that.
So update on the problem:
Got it half working. Here's my current code:
from Bio import SeqIO
sequence = SeqIO.parse("OG0000036.fa", "fasta")
Ahnfeltiopsis = open("Ahnfeltiopsis_flabelliformis.txt", "w")
Betaphycus = open("Betaphycus_philippinensis.txt", "w")
Ahnfeltiopsis_flabelliformis = 'Ahnfeltiopsis_flavelliformis'
Betaphycus_philippinensis = 'Betaphycus_philippinensis'
for x in sequence:
if Betaphycus_philippinensis in x.id:
SeqIO.write(x, Betaphycus, "fasta")
elif Ahnfeltiopsis_flabelliformis in x.id:
SeqIO.write(x, Ahnfeltiopsis, "fasta")
else:
continue
Ahnfeltiopsis.close()
Betaphycus.close()
As it stands, it places all the Betaphycus reads into the Betaphycus file, but skips over Ahnfeltiopsis. Not sure how to get past that. I've tried leaving the end blank and using else: continue. Not sure how to proceed from here.
Here is a Python code that reads one FASTA file and creates multiple FASTA files for each species separately.
from Bio import SeqIO
d = {}
fh = open('OG0000036.fa')
for seq_record in SeqIO.parse(fh, 'fasta'):
species_name = seq_record.id.split('-')[-1]
if species_name not in d:
d[species_name] = open(f"{species_name}.fa", 'w')
d[species_name].write(seq_record.format("fasta"))
fh.close()
Would this code work well in instances where there are multiple instances of the same species? For instance later I might have Kappaphycus_alvarezii_std, Kappaphycus_alvarezii_low, etc. Would it still work like that?
I'm guessing the seq_record.id.split('-')[-1] command splits the sections based on the - character, and takes the final section? Is that correct?
You are correct - the code will create two separate files for Kappaphycus_alvarezii_std and Kappaphycus_alvarezii_low. However, I can modify the code to get only the first two words from the full species name. For example, Kappaphycus_alvarezii_std and Kappaphycus_alvarezii_low would end up as one species Kappaphycus_alvarezii, and all their sequences will be saved to the Kappaphycus_alvarezii.fa file. Would that be ok?
Assuming that all fasta headers follow same pattern, please run above code. New files would be in "new_files" folder (you can change it) and each new fasta file starts with input_id, followed by species name (input.id_Pyropia_yezoensis.fasta) and within in each fasta, you would find all sequences pertaining to that species (from header).
with awk and flattened fasta sequences (sequences in single line for each record):
Create a folder by name new_folder in the same directory as input.fabefore code execution . All new sequences will appear in new_folder.
Take a back up of your fasta file before you execute the code
fasta stats from awk split:
file format type num_seqs sum_len min_len avg_len max_len
Ahnfeltiopsis_flabelliformis.fa FASTA Protein 1 189 189 189 189
Betaphycus_philippinensis.fa FASTA Protein 1 191 191 191 191
Ceramium_kondoi.fa FASTA Protein 1 191 191 191 191
Chondrus_crispus.fa FASTA Protein 1 191 191 191 191
Chroodactylon_ornatum.fa FASTA Protein 1 219 219 219 219
Dumontia_simplex.fa FASTA Protein 1 189 189 189 189
Eucheuma_denticulatum.fa FASTA Protein 1 191 191 191 191
Glaucosphaera_vacuolata.fa FASTA Protein 1 179 179 179 179
Gloiopeltis_furcata.fa FASTA Protein 1 190 190 190 190
Gracilaria_blodgettii.fa FASTA Protein 2 346 157 173 189
Gracilaria_lemaneiformis.fa FASTA Protein 1 191 191 191 191
Gracilaria_sp..fa FASTA Protein 1 189 189 189 189
Grateloupia_catenata.fa FASTA Protein 2 382 191 191 191
Grateloupia_filicina.fa FASTA Protein 3 569 179 189.7 196
Grateloupia_livida.fa FASTA Protein 1 194 194 194 194
Grateloupia_turuturu.fa FASTA Protein 2 370 185 185 185
Heterosiphonia_pulchra.fa FASTA Protein 1 191 191 191 191
Kappaphycus_alvarezii.fa FASTA Protein 1 157 157 157 157
Mazzaella_japonica.fa FASTA Protein 1 191 191 191 191
Neosiphonia_japonica.fa FASTA Protein 1 193 193 193 193
Porphyridium_cruentum.fa FASTA Protein 1 183 183 183 183
Porphyridium_purpureum.fa FASTA Protein 1 129 129 129 129
Pyropia_yezoensis.fa FASTA Protein 2 287 89 143.5 198
Rhodochaete_parvula.fa FASTA Protein 1 188 188 188 188
Thank you for explaining the issue in detail. It would help better if you could post an example input file and expected outcome file.
Hi, thanks for the reply. What I would want to do with that is sort each fasta by species name, then output it so that each fasta sequence for each species is outputted to its own fasta file. Note that there will be multiple sequences of various lengths for each species (I'm talking tens of thousands of sequences all up in the end).
Thanks.
Here's an example part: Edit: just noticed Biostars doesn't like copy-pasting fasta file. Idk how to fix that.
So update on the problem: Got it half working. Here's my current code:
As it stands, it places all the Betaphycus reads into the Betaphycus file, but skips over Ahnfeltiopsis. Not sure how to get past that. I've tried leaving the end blank and using else: continue. Not sure how to proceed from here.
Thanks
You have a typo in
'Ahnfeltiopsis_flavelliformis'
.Hi everyone,
Thanks heaps for the help. I managed to get some code working, but was bloated. I've tried out SeqKit and it seems quite good.
a.zielezinski, your code was also really good, and worked perfect. Thanks for that!
Lachlan