Hello all,
Previously, I posted about a question in a similar vein (see here) BUT now, 2 weeks later, I think I am nearly there! I plan to update that previous post and explain what I've done once I've tackled this final bit. (TL;DR my other question: I used the hittable, not the FASTA headers which I should've realised ages ago)
The problem:
I have a multifasta file with all the sequences that I have identified as overlapping. These results are grouped by GenBank Accession number and nucleotide positon:
>AK310930|1:38-236_Homo_sapiens
ATGAAGGCTCTCATTGTTCTGGGG
>AK310930|1:231-384_Homo_sapiens
CTGCAGTGCTTTGCTGCAAG
>XM_010841625|1:145-445_PREDICTED:_Bison_bison
ATGAA
>XM_010841625|1:444-512_PREDICTED:_Bison_bison
TGGGT
I have seperate these entries into their own seperate files (thanks Pierre!) which are just simply called _1.fasta, _2.fasta ect.
Using the merge function from EMBOSS does work and I am delighted to have found something that does the job I'm after. The catch is, manually adding your entries in takes time and there is a real chance I am staring at upwards of 1000+ files I'll have to use merger on.
How could I write a loop, suitable for someone on a macOS, that could run merge? Is that even possible? It took a noticeable amount of time for it to stitch two of these sequences together and I am worried about accidentally frying my MacBook (which is technically the unis!)!
Someone used perl to get a different EMBOSS function to work and it does look like it might be feesible but I really don't have any knowledge of perl and have never used it!
Would something like this do the job?:
for file in *.fasta; do merger file1.seq file2.seq -sreverse2 -outseq merged.seq "$file"; done
Thank you kindly in advance, I'm trying to understand if this is feasible and if I'm on the right path!
Please use
101010
button to properly format relevant sections. I have done it for you this time.Thank you Geno, I appreciate that and thank you for your patience!
What you are describing sounds pretty close to attempting to assemble sequences into contigs. Should we possibly better approach this as a general, possibly reference guided sequence assembly problem? The point with merger is that it computes optimal global alignments and is therefore slow. It will only get slower, the longer the sequences get. If you know already that the alignment sequence is like that seq_1 overlaps only seq_2, and seq_n overlaps only seq_n+1, than you could in principle apply it like: merge s1 s2 > assembly.fast, merge assembly.fasta s3, merge assembly.fasta s4, ... but that would become increasingly slower because a full NW alignment will be performed repeatedly, unnecessarily in my opinion. If you instead do not know which fragments overlap, they cannot be "assembled" that way at all.
Also it seems that you now the location of the fragments in their reference sequences already, so it might be much faster to simply align them to the reference or even simply join them based on the known positions.
@Michael problem is if you look at the examples above the overlap can be really small (down to one bp).
Leah : Do you always have two sequences per accession as shown above?
for
loop you describe would indeed do the job and should not fry your macbook. You can't simply dofor file in *.fasta
though. You should identify unique part of the two files and then use that instead. What do the two file names look like forAK310930
above?You are right, I also noticed that the examples as given do not really overlap, but then I am guessing this is because of the examples being shortend. Following the fasta headers, genomic or transcript coordinates are known:
So why not simply stitch the sequences together based on those ranges? Alignment-based merging or assembly is not going to work with so short overlaps.
What I would do:
This can be done with many different tools, even with EMBOSS tools and some scripting.
That is what I was thinking. It would be simpler to find outside co-ordinate bounds and then extract the sequence from those accessions.
Hi both,
Thank you so much for your input so far and I'm glad to hear I'm not completely barking mad trying to do this...! To answer your early question GenoMax , yes, so far, it's all been paired sequences. Names for the sequences are (at the minute) _1.fasta and _2.fasta. Simple but not too informative. I also have a larger file (simply called over.fasta) which has all the different overlapping files in it, sorted into "natural" order (ie by Accession number AND nucleotide position)
However, one of my supervisors other PhD student actually whipped up a python script based on what you've both been saying - it takes the numbers, figures out the difference (+1) and then puts the sequences together using that information. I've checked it and it does work! I can check with him about if he's cool with me sharing it here? Although I am interested to hear how you might attempt it in R, which I'm less familiar with BUT am inevitably going to have to learn.
I'll wait for your responses, thank you again so much for your suggestions, I can only admire your solutions!