How can I BLAST each sequence in a FASTA-file against all the other sequences in the same file?
3
0
Entering edit mode
7.5 years ago
driliwyr ▴ 10

I'm looking for a way to BLAST each sequence in a file, protein sequences in FASTA format, against all the other sequences in the same file. I'm only interested in the best HSP per sequence-sequence pair. So, for 10 sequences I'd get 10*10=100 outputs. Any easy and efficient way to do this? I'm using BLAST (blastp) 2.2.29+.

Update 1

Say I have 3 sequences in sequences.fasta:

>sp|P0C9J3|1101B_ASFWA Protein MGF 110-11L OS=African swine fever virus (isolate Warthog/Namibia/Wart80/1980) GN=War-022 PE=3 SV=1
MKVLLGLLLGYSVLILAHELPDLPRTQHPPKSELSYWCTYVPQCDFCWDCQDGICKNKIT
ESRFIDSNHSIVNCRVFRDSKTQSCLYEISSKMPNHFNMECLHPRPYTGNEIFMRTWGGG
DHQQLSIKQFCLYFIIGIAYTGCFVCALCKNLRLRTTMKLFILLSILVWLAQPVLNRPLS
IFYTKQILPRTYTPPMRELEYWCTYGKHCDFCWDCKNGICKNKVLDDMPLIVQNDYISKC
SITRFIDRCMYFIEPKIPYIHYMNCSLPTYFS
>sp|O55739|127L_IIV6 Uncharacterized protein 127L OS=Invertebrate iridescent virus 6 GN=IIV6-127L PE=4 SV=1
MTDYKHKINAEGEYLPFYGYTSISMLSSNTSSVQVLQNIEQLISQTIIGKYYSPLPHTTY
HMTLFNIYCMASSPIPPVQRWTLENEENKIPEKLWLSEDVLKIKHIKALDCLRKLPSHLK
ITNLEFYYKKGLGVWVTLDEESFNAVTKLRKEFSVIYEHDDANLKLHITFAYLFKELPFK
NGTREEKKALKTELLKLVKMLNTLKLWTLTNHNIYLYNSMTNYFPIDFFFSSSLV
>sp|P32234|128UP_DROME GTP-binding protein 128up OS=Drosophila melanogaster GN=128up PE=2 SV=2
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVA
KTGDARVGFVGFPSVGKSTLLSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPG
IIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKLLEHELEGFGIRLNKKPPNIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKG
QLPDYNSPVVLHNERTSIEDFCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDED
VVQIVKKV


I created a database with

user% makeblastdb -in sequences.fasta -title sequences -dbtype prot -out sequences -parse_seqids


Running each sequence as a query against the other, separately

user% blastp -query P0C9J3.fasta -db sequences -outfmt 6 -evalue 100
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|P0C9J3|1101B_ASFWA   sp|P32234|128UP_DROME   50.00   12      6       0       188     199     291     302     3.7     15.8
sp|P0C9J3|1101B_ASFWA   sp|O55739|127L_IIV6     33.33   12      8       0       110     121     88      99         54   11.9

user% blastp -query P32234.fasta -db sequences -outfmt 6 -evalue 100
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756
sp|P32234|128UP_DROME   sp|P0C9J3|1101B_ASFWA   50.00   12      6       0       291     302     188     199     4.5     15.8
sp|P32234|128UP_DROME   sp|O55739|127L_IIV6     35.29   17      11      0       255     271     106     122     7.7     15.0

user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   33.33   12      8       0       88      99      110     121        47   11.9
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2


O55739 but with -max_hsps 1 added (only getting one result per query-target pair, wanted behaviour)

user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2


All sequences against all

user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|P0C9J3|1101B_ASFWA   sp|P32234|128UP_DROME   50.00   12      6       0       188     199     291     302     3.7     15.8
sp|P0C9J3|1101B_ASFWA   sp|O55739|127L_IIV6     33.33   12      8       0       110     121     88      99         54   11.9
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   33.33   12      8       0       88      99      110     121        47   11.9
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756
sp|P32234|128UP_DROME   sp|P0C9J3|1101B_ASFWA   50.00   12      6       0       291     302     188     199     4.5     15.8
sp|P32234|128UP_DROME   sp|O55739|127L_IIV6     35.29   17      11      0       255     271     106     122     7.7     15.0


All sequences against all and adding -max_hsps 1 flag

user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756


I would like to have the result as presented when running each sequence separately against the sequences-database, but all presented together by just running one command, i.e. 3 hits separately but an output-list of all 9. If I run sequences.fasta as query against the database and with -max_hsps 1 flag, I only get the top hit for each query sequence. Maybe I'm misunderstanding the -max_hsps flag? I could solve this by scripting and run a separate query for each sequence, but that would cause a loss of the multi-core ability?

blast • 7.0k views
1
Entering edit mode
7.5 years ago
thiagomafra ▴ 70

I don't know if I understand well your question, but... if you to duplicate the fasta file and to use a as -query and another as -db? You also can to use -outfmt 6 (tabular format), -best_hit_score_edge 0.1.

0
Entering edit mode

I updated the original question with some more information. Hope that it is the correct procedure. Thanks for helping.

0
Entering edit mode
7.5 years ago
5heikki 10k
-query yourFile -subject yourFile


Or if you want multithreaded then you create a db from your file and blast against that. As thiagomafra writes, you should probably use outfmt -6 for e.g. easy removal of self hits and sorting of best hits..

0
Entering edit mode

I updated the original question with some more information. Hope that it is the correct procedure. Thanks for helping.

0
Entering edit mode
7.5 years ago
Renesh ★ 2.1k

Run the blastp commnad with option -max_target_seqs 2. This will give you 2 hits per query. When you do self blast, obviously the first hit will be query itself. Then delete the first hit and consider the second hit is the best match for your query.

0
Entering edit mode

But I want 3 hits per query in this case. Running with -max_target_seqs 3 gives results:

user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 -max_target_seqs 3
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|P0C9J3|1101B_ASFWA   sp|P32234|128UP_DROME   50.00   12      6       0       188     199     291     302     3.7     15.8
sp|P0C9J3|1101B_ASFWA   sp|O55739|127L_IIV6     33.33   12      8       0       110     121     88      99         54   11.9
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   33.33   12      8       0       88      99      110     121        47   11.9
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756
sp|P32234|128UP_DROME   sp|P0C9J3|1101B_ASFWA   50.00   12      6       0       291     302     188     199     4.5     15.8
sp|P32234|128UP_DROME   sp|O55739|127L_IIV6     35.29   17      11      0       255     271     106     122     7.7     15.0


Which gives two alignments for O55739 against P0C9J3. I only want one.

I take the opportunity to ask: what is the difference between max_target_seqs, max_hsps, num_descriptions and num_alignments?

Thanks for taking the time to help!

0
Entering edit mode

max_target_seqs is the maximum number of hits for the query sequence against the target database. This is incompatible with num_alignments. This means when you provide output in tabular format (-outfmt = 6), you should always use max_target_seqs. The num_alignments works with (-outfmt = 0).

Similarly, num_descriptions does not works with outfmt > 4. This option will show one line descriptions for target sequences which has hit to query sequence.

max_hsps indicates the number of HSPs per hits. One query may have multiple hits in same target sequences. You can control reporting of HSPs per query using this option (here understand hit and hsp concept).

I would recommend you to use max_target_seqs = 2, which will give you 2 hits per query. And then delete first hits (you can write your own code for this) and second will be the high scoring hits to your query.