Question: BLAST all against all
0
gravatar for hala.elbarbary
4 days ago by
hala.elbarbary0 wrote:

I am trying to do BLAST all against all for a fasta file of 324 protein sequence using this command:

blastp -query myseq.fasta -subject myseq.fasta -outfmt 6 >all_against_all.tsv

should the number of alignments in the output file = 324*324?

I get a different number of alignments on different machines, yet lower than 324*324
why do they differ using the same fasta file and the same command?

ADD COMMENTlink modified 3 days ago by jrj.healey8.3k • written 4 days ago by hala.elbarbary0

few initial remarks:

your (formatted) blastDB is also called myseq.fasta?

No, why would you think so? the only certainty is that is should be at least 324 alignments (ignoring the rare exceptions where it can be lower)

that's interesting. The exact same command , with exactly the same input & DB. that should not happen. Can you post a bit more info about that (what are the difference machines, how different are the blast outputs?)

ADD REPLYlink written 4 days ago by lieven.sterck3.0k

I think that's valid, I vaguely recall doing something like that in the past myself.

Lieven's comments about the number of alignments is right though. You don't just get 1v1 alignments from blast as it's a local aligner. It will give you all local substring alignments that meet your scoring criteria.

If you only want 1v1 all-vs-all comparisons, what you're really looking for is "pairwise global alignment".

ADD REPLYlink written 4 days ago by jrj.healey8.3k

yes, that is valid, but was just trying to get confirmation that OP did not just added twice the fasta file

ADD REPLYlink written 3 days ago by lieven.sterck3.0k

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 3 days ago by RamRS19k

I think you have a prob with your -subject myseq.fasta ( i hope this file is refefence). Try to use blastDB and create database from all reference protein seq files. you can create single alias file for your database. then repeat your command with -db parameter instead of -subject. Thanks

ADD REPLYlink modified 3 days ago • written 3 days ago by k.kathirvel93130

This point has already been covered by lieven and jrj.healey above.

ADD REPLYlink written 3 days ago by RamRS19k

There is nothing wrong with the command itself - blast doesnt need to use a database necessarily (I've just tested it). OP is, however, mistaken in the expectation that you'd get n^2 alignments back from an all-vs-all blast of n sequences.

ADD REPLYlink modified 3 days ago • written 3 days ago by jrj.healey8.3k
3
gravatar for jrj.healey
3 days ago by
jrj.healey8.3k
United Kingdom
jrj.healey8.3k wrote:

To clarify from the comments, here's an example to prove the command works and the misconception about numbers of alignments.

I have the following input file (I've omitted the sequences for length):

>PAK_01787 PAK_01787 T4-like virus tail tube protein gp19 1997281:1997730 forward
>PAK_01896 PAK_01896 T4-like virus tail tube protein gp19 2128407:2128844 reverse
>PAK_02014 PAK_02014 T4-like virus tail tube protein gp19 2270593:2271042 reverse
>PAK_02606 PAK_02606 T4-like virus tail tube protein gp19 2905301:2905750 forward
>PAK_03203 PAK_03203 T4-like virus tail tube protein gp19 3608015:3608464 reverse
>PAU_01961 PAU_01961 T4-like virus tail tube protein gp19 2233799:2234248 forward
>PAU_02074 PAU_02074 T4-like virus tail tube protein gp19 2377791:2378228 reverse
>PAU_02206 PAU_02206 T4-like virus tail tube protein gp19 2539274:2539723 reverse
>PAU_02775 PAU_02775 T4-like virus tail tube protein gp19 3218747:3219196 forward
>PAU_03392 PAU_03392 T4-like virus tail tube protein gp19 3933715:3934164 reverse
>PLT_01696 PLT_01696 T4-like virus tail tube protein gp19 1991047:1991496 reverse
>PLT_01716 PLT_01716 T4-like virus tail tube protein gp19 2015263:2015712 reverse
>PLT_01736 PLT_01736 T4-like virus tail tube protein gp19 2039178:2039627 reverse
>PLT_01758 PLT_01758 T4-like virus tail tube protein gp19 2064746:2065195 reverse
>PLT_02424 PLT_02424 T4-like virus tail tube protein gp19 2800483:2800932 forward
>PLT_02568 PLT_02568 T4-like virus tail tube protein gp19 2980404:2980853 reverse

These proteins are all very similar to one another, but not the same. There are 16 total sequences.

The command:

blastp -query PVC1_homologs.fsa -subject PVC1_homologs.fsa -outfmt 6 | tee -a allvsallblast.tsv

Gives 20 results, as some sequences match more than just their selves (you'd get more results as you lower the result stringency, or have more sequences with increasing numbers of subsequence matches):

PAK_01787   PAK_01787   100.000 450 0   0   1   450 1   450 0.0 554
PAK_01787   PAU_01961   96.889  450 14  0   1   450 1   450 0.0 518
PAK_01896   PAK_01896   100.000 438 0   0   1   438 1   438 0.0 537
PAK_02014   PAK_02014   100.000 450 0   0   1   450 1   450 0.0 546
PAK_02014   PAU_02206   99.333  450 3   0   1   450 1   450 0.0 541
PAK_02606   PAK_02606   100.000 450 0   0   1   450 1   450 0.0 553
PAK_03203   PAK_03203   100.000 450 0   0   1   450 1   450 0.0 552
PAU_01961   PAU_01961   100.000 450 0   0   1   450 1   450 0.0 552
PAU_01961   PAK_01787   96.889  450 14  0   1   450 1   450 0.0 518
PAU_02074   PAU_02074   100.000 438 0   0   1   438 1   438 0.0 538
PAU_02206   PAU_02206   100.000 450 0   0   1   450 1   450 0.0 546
PAU_02206   PAK_02014   99.333  450 3   0   1   450 1   450 0.0 541
PAU_02775   PAU_02775   100.000 450 0   0   1   450 1   450 0.0 554
PAU_03392   PAU_03392   100.000 450 0   0   1   450 1   450 0.0 551
PLT_01696   PLT_01696   100.000 450 0   0   1   450 1   450 0.0 555
PLT_01716   PLT_01716   100.000 450 0   0   1   450 1   450 0.0 554
PLT_01736   PLT_01736   100.000 450 0   0   1   450 1   450 0.0 554
PLT_01758   PLT_01758   100.000 450 0   0   1   450 1   450 0.0 554
PLT_02424   PLT_02424   100.000 450 0   0   1   450 1   450 0.0 554
PLT_02568   PLT_02568   100.000 450 0   0   1   450 1   450 0.0 551

Note the cases where the % ID is not 100, there is therefore more than 1 match per sequence in an all vs all blast. If I was using a better example dataset, you'd see substring matches within the same hits too.

ADD COMMENTlink written 3 days ago by jrj.healey8.3k

nice, one more thing I learned about blast (== doing "bl2seq" on multifasta files)

ADD REPLYlink written 3 days ago by lieven.sterck3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour