Question: percent identities for all by all protein alignment
0
gravatar for bioguy
12 months ago by
bioguy50
bioguy50 wrote:

Very simple question – I have a list of, say, 1000 protein sequences in a fasta file. I want to compute percent identities for each pair of sequences. So, in this example, it'd generated a 1000 by 1000 matrix reflected over the diagonal.

I've tried diamond and blastp, but even when I set -max-target-seqs to 0 or set the evalue/min percent identity to appropriately high/low values, they still don't report all possible alignments. There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

Any recommendations for how to get this done quickly? I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

Thanks in advance.

ADD COMMENTlink modified 12 months ago by fishgolden420 • written 12 months ago by bioguy50
2
gravatar for fishgolden
12 months ago by
fishgolden420
fishgolden420 wrote:

There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

I think this is really difficult task since the reason why those programs (at least BLAST) are fast is that they have some filtering steps.

In my opinion, one of the critical filters still remaining in your protocol with BLAST is that the "word size" & I don't know about DIAMOND very much, sorry.

Anyway, I recommend water or needle in EMBOSS package http://emboss.sourceforge.net/ . Or SSEARCH, or GGSEARCH in FASTA package https://github.com/wrpearson/fasta36 for your purpose.

ADD COMMENTlink written 12 months ago by fishgolden420
0
gravatar for Joe
12 months ago by
Joe16k
United Kingdom
Joe16k wrote:

Neither DIAMOND nor BLAST are the tool for this really.

Just iterate over your file in an exhaustive pairwise approach, and use a normal aligner like MAFFT or CLUSTAL which can give you a score.

I'd advise looking at ways to parallelize this as you have an n choose k == 499500 problem here.

ADD COMMENTlink written 12 months ago by Joe16k
1

Thank you – ended up going the CLUSTAL route with some basic multi-threading to speed things up. Very much appreciate your help!

ADD REPLYlink written 12 months ago by bioguy50

No problem ;)

ADD REPLYlink written 12 months ago by Joe16k

CLUSTAL and MAFFT are Multiple Sequence Alignment software that they perform stringent DP (edit: jrj.healey said CLUSTAL's alignment is a bit janky Its alignment may not be stringent one...) same as water and needle when they are used as pairwise aligner. I mean, I think CLUSTAL and MAFFT are over spec. But it is ok because the results would be the same when the same parameters were used, i guess.

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420
1

Clustal begins by performing all vs all pairwise alignments anyway, but other threads on the site have shown that its pairwise alignments can be a bit janky. Using needle or water directly is likely a good approach too. In my experience, MAFFT gives better DNA alignments than does CLUSTAL anyway.

(Those were just the first 2 to come to mind, there are many other aligners to experiment with :) )

ADD REPLYlink written 12 months ago by Joe16k

but other threads on the site have shown that its pairwise alignments can be a bit janky.

Oh, really. I didn't know that. Thank you for the comment.

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

I'm mainly referring to this one: what is the problem with using clustal to do pairwise alignment?

But its far from a systematic evaluation, so YMMV.

ADD REPLYlink written 12 months ago by Joe16k

In Clustal Omega, the alignments are then computed using the very accurate HHalign package (Söding, 2005), which aligns two profile hidden Markov models (Eddy, 1998).

http://msb.embopress.org/content/7/1/539

It seems that Clustal Omega uses HHalign.

A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

ClustalW is using DP by default.

http://www.clustal.org/download/clustalw_help.txt

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

Ah ok, thank you. Will give the others you recommended a try as well if need be. Really appreciate all the help!

ADD REPLYlink written 12 months ago by bioguy50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1556 users visited in the last hour