Previously, I posted about a question in a similar vein (see here) BUT now, 2 weeks later, I think I am nearly there! I plan to update that previous post and explain what I've done once I've tackled this final bit. (TL;DR my other question: I used the hittable, not the FASTA headers which I should've realised ages ago)
I have a multifasta file with all the sequences that I have identified as overlapping. These results are grouped by GenBank Accession number and nucleotide positon:
>AK310930|1:38-236_Homo_sapiens ATGAAGGCTCTCATTGTTCTGGGG >AK310930|1:231-384_Homo_sapiens CTGCAGTGCTTTGCTGCAAG >XM_010841625|1:145-445_PREDICTED:_Bison_bison ATGAA >XM_010841625|1:444-512_PREDICTED:_Bison_bison TGGGT
I have seperate these entries into their own seperate files (thanks Pierre!) which are just simply called _1.fasta, _2.fasta ect.
Using the merge function from EMBOSS does work and I am delighted to have found something that does the job I'm after. The catch is, manually adding your entries in takes time and there is a real chance I am staring at upwards of 1000+ files I'll have to use merger on.
How could I write a loop, suitable for someone on a macOS, that could run merge? Is that even possible? It took a noticeable amount of time for it to stitch two of these sequences together and I am worried about accidentally frying my MacBook (which is technically the unis!)!
Someone used perl to get a different EMBOSS function to work and it does look like it might be feesible but I really don't have any knowledge of perl and have never used it!
Would something like this do the job?:
for file in *.fasta; do merger file1.seq file2.seq -sreverse2 -outseq merged.seq "$file"; done
Thank you kindly in advance, I'm trying to understand if this is feasible and if I'm on the right path!