Is it possible to use a loop to get the EMBOSS Merger function to work on multiple FASTA files?
1
0
Entering edit mode
3 months ago
Leah ▴ 10

Hello all,

Previously, I posted about a question in a similar vein (see here) BUT now, 2 weeks later, I think I am nearly there! I plan to update that previous post and explain what I've done once I've tackled this final bit. (TL;DR my other question: I used the hittable, not the FASTA headers which I should've realised ages ago)

The problem:

I have a multifasta file with all the sequences that I have identified as overlapping. These results are grouped by GenBank Accession number and nucleotide positon:

>AK310930|1:38-236_Homo_sapiens
ATGAAGGCTCTCATTGTTCTGGGG

>AK310930|1:231-384_Homo_sapiens
CTGCAGTGCTTTGCTGCAAG

>XM_010841625|1:145-445_PREDICTED:_Bison_bison
ATGAA

>XM_010841625|1:444-512_PREDICTED:_Bison_bison
TGGGT


I have seperate these entries into their own seperate files (thanks Pierre!) which are just simply called _1.fasta, _2.fasta ect.

Using the merge function from EMBOSS does work and I am delighted to have found something that does the job I'm after. The catch is, manually adding your entries in takes time and there is a real chance I am staring at upwards of 1000+ files I'll have to use merger on.

How could I write a loop, suitable for someone on a macOS, that could run merge? Is that even possible? It took a noticeable amount of time for it to stitch two of these sequences together and I am worried about accidentally frying my MacBook (which is technically the unis!)!

Someone used perl to get a different EMBOSS function to work and it does look like it might be feesible but I really don't have any knowledge of perl and have never used it!

Would something like this do the job?:

for file in *.fasta; do merger file1.seq file2.seq -sreverse2 -outseq merged.seq "\$file"; done


Thank you kindly in advance, I'm trying to understand if this is feasible and if I'm on the right path!

nucleotides merger FASTA EMBOSS • 889 views
1
Entering edit mode

Please use 101010 button to properly format relevant sections. I have done it for you this time.

0
Entering edit mode

Thank you Geno, I appreciate that and thank you for your patience!

1
Entering edit mode

What you are describing sounds pretty close to attempting to assemble sequences into contigs. Should we possibly better approach this as a general, possibly reference guided sequence assembly problem? The point with merger is that it computes optimal global alignments and is therefore slow. It will only get slower, the longer the sequences get. If you know already that the alignment sequence is like that seq_1 overlaps only seq_2, and seq_n overlaps only seq_n+1, than you could in principle apply it like: merge s1 s2 > assembly.fast, merge assembly.fasta s3, merge assembly.fasta s4, ... but that would become increasingly slower because a full NW alignment will be performed repeatedly, unnecessarily in my opinion. If you instead do not know which fragments overlap, they cannot be "assembled" that way at all.

Also it seems that you now the location of the fragments in their reference sequences already, so it might be much faster to simply align them to the reference or even simply join them based on the known positions.

1
Entering edit mode

@Michael problem is if you look at the examples above the overlap can be really small (down to one bp).

Leah : Do you always have two sequences per accession as shown above? for loop you describe would indeed do the job and should not fry your macbook. You can't simply do for file in *.fasta though. You should identify unique part of the two files and then use that instead. What do the two file names look like for AK310930 above?

1
Entering edit mode

You are right, I also noticed that the examples as given do not really overlap, but then I am guessing this is because of the examples being shortend. Following the fasta headers, genomic or transcript coordinates are known:

 >AK310930|1:*38-236*_Homo_sapiens


So why not simply stitch the sequences together based on those ranges? Alignment-based merging or assembly is not going to work with so short overlaps.

What I would do:

• Extract all intervals and sequence identifiers and load into a GenomicRanges object in R
• Normalize all the intervals to make continuous intervals
• Load the DNA-sequences into R and extract the continuous ranges using the GRanges

This can be done with many different tools, even with EMBOSS tools and some scripting.

1
Entering edit mode

So why not simply stitch the sequences together based on those ranges?

That is what I was thinking. It would be simpler to find outside co-ordinate bounds and then extract the sequence from those accessions.

1
Entering edit mode

Hi both,

Thank you so much for your input so far and I'm glad to hear I'm not completely barking mad trying to do this...! To answer your early question GenoMax , yes, so far, it's all been paired sequences. Names for the sequences are (at the minute) _1.fasta and _2.fasta. Simple but not too informative. I also have a larger file (simply called over.fasta) which has all the different overlapping files in it, sorted into "natural" order (ie by Accession number AND nucleotide position)

However, one of my supervisors other PhD student actually whipped up a python script based on what you've both been saying - it takes the numbers, figures out the difference (+1) and then puts the sequences together using that information. I've checked it and it does work! I can check with him about if he's cool with me sharing it here? Although I am interested to hear how you might attempt it in R, which I'm less familiar with BUT am inevitably going to have to learn.

0
Entering edit mode
3 months ago

Here is some example code how to do this in R/BioConductor. This answer is based on what was found out during our discussion. Just in case someone wonders how this is at all an answer to original question, please read the comments.

As you can see, the core operations are essentially a two-liner:

rgr <- reduce(gr) # This is why I chose GRanges
res <- seqs[rgr]


The rest of the code is simply there to fetch reproducible example data. For the sake of the example, I didn't bother dealing with FASTA files and parsing of intervals. Instead, the ranges are defined inline, and the sequences are fetched from GenBank directly. Additional parsing steps can be added if required.

### This is a self-contained example (except for package installation)
library(GenomicRanges)
library (genbankr)
### define genomic ranges, these could be parsed from a GFF3 file
gr <- GRanges(seqnames = Rle(c("AK310930", "AK310930", "XM_010841625", "XM_010841625")), seqlengths = NULL, ranges = IRanges(start = c(38, 231, 145, 444),  end = c(236, 384, 445, 512)))
gr
### This is core step 1 of the operation: join the intervals into a reduced GRanges object with coverage 0 or 1
rgr <- reduce(gr)
rgr

# fetch example sequences directly from genbank using the genbank id
seqfetch.1 <- function(gbkid)
# readGenBank fetches only one seq at a time
seqs <- DNAStringSet(sapply(unique(seqnames(gr)), seqfetch.1))
names(seqs) <- unique(seqnames(gr))
seqs

### This is core step 2 of the operation: subset the sequences by the GRanges object
res <- seqs[rgr]
names(res) <- seqnames(rgr) # names get lost in the operation, so re-attach names
### That's it
res


Output:

GRanges object with 4 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     AK310930    38-236      *
[2]     AK310930   231-384      *
[3] XM_010841625   145-445      *
[4] XM_010841625   444-512      *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
GRanges object with 2 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     AK310930    38-384      *
[2] XM_010841625   145-512      *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
No exons read from genbank file. Assuming sections of CDS are full exons
Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
No exons read from genbank file. Assuming sections of CDS are full exons
No transcript features (mRNA) found, using spans of CDSs
DNAStringSet object of length 2:
width seq                                          names
[1]  1043 GGCCTCTGGAGCCTAGCACTC...AGTCAGAGGTGGCGAAACCC AK310930
[2]   570 TGCCAAAGGGGAAGATTTTGG...TCATATTAAGGAAACGATAG XM_010841625
DNAStringSet object of length 2:
width seq                                          names
[1]   347 ATGAAGGCTCTCATTGTTCTG...TTCAAGGTTGTGGAGTGTAA AK310930
[2]   368 ATGAAGGCTCTCCTTATTCTG...TTCAGGGTTGCGGAGTATAA XM_010841625