fast global amino acid alignment without db creation
2
0
Entering edit mode
8.1 years ago
Leszek 4.1k

I'm looking for the global amino acid alignment functions/program that can be easily wrapped into Python. I operate on single query and set of up to a few hundred targets to align. So far I've been using BLAT. It's very fast, but both query and targets need to be saved into files, and output is again served as a file.

Do you know of any alternative solution, preferably reading from stdin/writing to stdout, that is as fast as BLAT and can be easily wrapped into Python? I don't care that much about sensitivity, but speed is very important, thus native biopython pairwise2 is too slow.

I have explored some implementations, but all required database creation, which of course I prefer to avoid.

Handling nucleotide alignments is a big pros, but not mandatory.

EDIT

I'm not looking for sequence similarity search tool, but for global alignment tool/library that can be easily wrapped into Python.

alignment python protein • 2.4k views
0
Entering edit mode

BLAT is not really for global alignment and it lacks sensitivity. For a few hundreds rounds of alignment, just use needle from EMBOSS.

0
Entering edit mode
8.1 years ago
xbello ▴ 20

I don't know if this will fit your needs, but Biopython has parser for BLAT outputs. This way you can automate all the BLAT pipeline, wrapping it all with python/Biopython.

I also don't know why _of course_ you prefer to avoid the database creation. The key for speeding the searches is the indexing of the sequences, and AFAIK this involves database creation, at least in a NoSQL fashion (key: sequence). No indexing, no speed. The classical BLAST itself runs quite faster when you index the subject sequences, if the subject is huge.

0
Entering edit mode

Thanks, I didn't know about BlatIO. That may be useful, but tab-delimited output is quite handy anyway.

Simply, I start with pre-selected targets, so no need for similarity search whatsoever. I just want to align them to the query in order to know identity/overlap and get pairwise alignment.

0
Entering edit mode

Whenever I have to do these things, I:

1. Save the query and the subject in NamedTemporaryFile.
2. Launch a BLAST, BLAT, MUSCLE, whatever... through subprocess.Popen().
3. Catch the stdout/stderr and output files. Biopython can load almost all format output with AlignIO.
4. Delete the intermediates.

This way you get the speed in step #2 and keep the logic together. E.g. this is a code I'm actually using, to align thousands of pairs of sequences one to each other, passed as pair = {"key": "Bio.Seq()", "key2": "Bio.Seq()"}:

def align(pair):
"""Return a string with the file name to the alignment between a pair."""
# Save the matches to temporary files
tmp_align = NamedTemporaryFile()
align_file = tmp_align.name + ".al.fasta"
SeqIO.write(pair.values(), tmp_align, "fasta")
# Rewind the file so Muscle can read it
tmp_align.seek(0)

# Align the matches
proc = subprocess.Popen(
[c.MUSCLE,  # This is the path to Muscle binary
"-in", tmp_align.name,
"-out", align_file],
stderr=subprocess.PIPE)
_, err = proc.communicate()
tmp_align.close()
# Check if Muscle failed
if "ERROR" in err:
raise IOError(err)
return align_file

0
Entering edit mode

Actually, this is what I was doing so far. As stated in my post:

1. I want to avoid creating files (event temporary ones!)
2. I prefer programs or libraries (maybe some C/C++ libs?) that can align sequences fast, and can avoid building db

This said, I'm not looking for sequence similarity search tool, but for global alignment tool.

0
Entering edit mode
8.1 years ago
hpmcwill ★ 1.2k

Since you want global pairwise sequence alignment then you will want to look at:

As command-line tools these both support one of the two sequences to be aligned coming from STDIN. BioPython includes support for EMBOSS needle and it shouldn't be too difficult to extend this to EMBOSS stretcher.

Since you want to wrap in Python, you could look into accessing these as code libraries, EMBOSS is structured as a set of libraries with a thin layer of code to provide the command-line tools, and so should be ideal for this type of usage. That said the FASTA suite includes accelerated implementations of the methods, and so would be desirable, however the implementation may be more difficult to wrap as anything by an external process.

Another option you might want to look at, although I am unclear on its current status, would be BioLib, which promised to provide wrappers to common bioinformatics C/C++ libraries for the Bio* projects (BioPerl, BioPython, etc.).