Question: percent identities for all by all protein alignment
0
gravatar for bioguy
8 days ago by
bioguy30
bioguy30 wrote:

Very simple question – I have a list of, say, 1000 protein sequences in a fasta file. I want to compute percent identities for each pair of sequences. So, in this example, it'd generated a 1000 by 1000 matrix reflected over the diagonal.

I've tried diamond and blastp, but even when I set -max-target-seqs to 0 or set the evalue/min percent identity to appropriately high/low values, they still don't report all possible alignments. There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

Any recommendations for how to get this done quickly? I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

Thanks in advance.

ADD COMMENTlink modified 7 days ago by fishgolden390 • written 8 days ago by bioguy30
2
gravatar for fishgolden
7 days ago by
fishgolden390
fishgolden390 wrote:

There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

I think this is really difficult task since the reason why those programs (at least BLAST) are fast is that they have some filtering steps.

In my opinion, one of the critical filters still remaining in your protocol with BLAST is that the "word size" & I don't know about DIAMOND very much, sorry.

Anyway, I recommend water or needle in EMBOSS package http://emboss.sourceforge.net/ . Or SSEARCH, or GGSEARCH in FASTA package https://github.com/wrpearson/fasta36 for your purpose.

ADD COMMENTlink written 7 days ago by fishgolden390
0
gravatar for jrj.healey
7 days ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Neither DIAMOND nor BLAST are the tool for this really.

Just iterate over your file in an exhaustive pairwise approach, and use a normal aligner like MAFFT or CLUSTAL which can give you a score.

I'd advise looking at ways to parallelize this as you have an n choose k == 499500 problem here.

ADD COMMENTlink written 7 days ago by jrj.healey12k
1

Thank you – ended up going the CLUSTAL route with some basic multi-threading to speed things up. Very much appreciate your help!

ADD REPLYlink written 7 days ago by bioguy30

No problem ;)

ADD REPLYlink written 7 days ago by jrj.healey12k

CLUSTAL and MAFFT are Multiple Sequence Alignment software that they perform stringent DP (edit: jrj.healey said CLUSTAL's alignment is a bit janky Its alignment may not be stringent one...) same as water and needle when they are used as pairwise aligner. I mean, I think CLUSTAL and MAFFT are over spec. But it is ok because the results would be the same when the same parameters were used, i guess.

ADD REPLYlink modified 7 days ago • written 7 days ago by fishgolden390
1

Clustal begins by performing all vs all pairwise alignments anyway, but other threads on the site have shown that its pairwise alignments can be a bit janky. Using needle or water directly is likely a good approach too. In my experience, MAFFT gives better DNA alignments than does CLUSTAL anyway.

(Those were just the first 2 to come to mind, there are many other aligners to experiment with :) )

ADD REPLYlink written 7 days ago by jrj.healey12k

but other threads on the site have shown that its pairwise alignments can be a bit janky.

Oh, really. I didn't know that. Thank you for the comment.

ADD REPLYlink modified 7 days ago • written 7 days ago by fishgolden390

I'm mainly referring to this one: what is the problem with using clustal to do pairwise alignment?

But its far from a systematic evaluation, so YMMV.

ADD REPLYlink written 6 days ago by jrj.healey12k

In Clustal Omega, the alignments are then computed using the very accurate HHalign package (Söding, 2005), which aligns two profile hidden Markov models (Eddy, 1998).

http://msb.embopress.org/content/7/1/539

It seems that Clustal Omega uses HHalign.

A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

ClustalW is using DP by default.

http://www.clustal.org/download/clustalw_help.txt

ADD REPLYlink modified 6 days ago • written 6 days ago by fishgolden390

Ah ok, thank you. Will give the others you recommended a try as well if need be. Really appreciate all the help!

ADD REPLYlink written 7 days ago by bioguy30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2268 users visited in the last hour