I'm looking for a way to BLAST each sequence in a file, protein sequences in FASTA format, against all the other sequences in the same file. I'm only interested in the best HSP per sequence-sequence pair. So, for 10 sequences I'd get 10*10=100 outputs. Any easy and efficient way to do this? I'm using BLAST (blastp) 2.2.29+.
Update 1:
Say I have 3 sequences in sequences.fasta:
>sp|P0C9J3|1101B_ASFWA Protein MGF 110-11L OS=African swine fever virus (isolate Warthog/Namibia/Wart80/1980) GN=War-022 PE=3 SV=1 MKVLLGLLLGYSVLILAHELPDLPRTQHPPKSELSYWCTYVPQCDFCWDCQDGICKNKIT ESRFIDSNHSIVNCRVFRDSKTQSCLYEISSKMPNHFNMECLHPRPYTGNEIFMRTWGGG DHQQLSIKQFCLYFIIGIAYTGCFVCALCKNLRLRTTMKLFILLSILVWLAQPVLNRPLS IFYTKQILPRTYTPPMRELEYWCTYGKHCDFCWDCKNGICKNKVLDDMPLIVQNDYISKC SITRFIDRCMYFIEPKIPYIHYMNCSLPTYFS >sp|O55739|127L_IIV6 Uncharacterized protein 127L OS=Invertebrate iridescent virus 6 GN=IIV6-127L PE=4 SV=1 MTDYKHKINAEGEYLPFYGYTSISMLSSNTSSVQVLQNIEQLISQTIIGKYYSPLPHTTY HMTLFNIYCMASSPIPPVQRWTLENEENKIPEKLWLSEDVLKIKHIKALDCLRKLPSHLK ITNLEFYYKKGLGVWVTLDEESFNAVTKLRKEFSVIYEHDDANLKLHITFAYLFKELPFK NGTREEKKALKTELLKLVKMLNTLKLWTLTNHNIYLYNSMTNYFPIDFFFSSSLV >sp|P32234|128UP_DROME GTP-binding protein 128up OS=Drosophila melanogaster GN=128up PE=2 SV=2 MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVA KTGDARVGFVGFPSVGKSTLLSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPG IIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKLLEHELEGFGIRLNKKPPNIY YKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKG QLPDYNSPVVLHNERTSIEDFCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDED VVQIVKKV
I created a database with
user% makeblastdb -in sequences.fasta -title sequences -dbtype prot -out sequences -parse_seqids
Running each sequence as a query against the other, separately:
user% blastp -query P0C9J3.fasta -db sequences -outfmt 6 -evalue 100 sp|P0C9J3|1101B_ASFWA sp|P0C9J3|1101B_ASFWA 100.00 272 0 0 1 272 1 272 0.0 564 sp|P0C9J3|1101B_ASFWA sp|P32234|128UP_DROME 50.00 12 6 0 188 199 291 302 3.7 15.8 sp|P0C9J3|1101B_ASFWA sp|O55739|127L_IIV6 33.33 12 8 0 110 121 88 99 54 11.9
user% blastp -query P32234.fasta -db sequences -outfmt 6 -evalue 100 sp|P32234|128UP_DROME sp|P32234|128UP_DROME 100.00 368 0 0 1 368 1 368 0.0 756 sp|P32234|128UP_DROME sp|P0C9J3|1101B_ASFWA 50.00 12 6 0 291 302 188 199 4.5 15.8 sp|P32234|128UP_DROME sp|O55739|127L_IIV6 35.29 17 11 0 255 271 106 122 7.7 15.0
user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100 sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481 sp|O55739|127L_IIV6 sp|P32234|128UP_DROME 35.29 17 11 0 106 122 255 271 5.0 15.0 sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 33.33 12 8 0 88 99 110 121 47 11.9 sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 57.14 7 3 0 120 126 58 64 91 11.2
O55739 but with -max_hsps 1 added (only getting one result per query-target pair, wanted behaviour):
user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1 sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481 sp|O55739|127L_IIV6 sp|P32234|128UP_DROME 35.29 17 11 0 106 122 255 271 5.0 15.0 sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 57.14 7 3 0 120 126 58 64 91 11.2
All sequences against all:
user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 sp|P0C9J3|1101B_ASFWA sp|P0C9J3|1101B_ASFWA 100.00 272 0 0 1 272 1 272 0.0 564 sp|P0C9J3|1101B_ASFWA sp|P32234|128UP_DROME 50.00 12 6 0 188 199 291 302 3.7 15.8 sp|P0C9J3|1101B_ASFWA sp|O55739|127L_IIV6 33.33 12 8 0 110 121 88 99 54 11.9 sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481 sp|O55739|127L_IIV6 sp|P32234|128UP_DROME 35.29 17 11 0 106 122 255 271 5.0 15.0 sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 33.33 12 8 0 88 99 110 121 47 11.9 sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 57.14 7 3 0 120 126 58 64 91 11.2 sp|P32234|128UP_DROME sp|P32234|128UP_DROME 100.00 368 0 0 1 368 1 368 0.0 756 sp|P32234|128UP_DROME sp|P0C9J3|1101B_ASFWA 50.00 12 6 0 291 302 188 199 4.5 15.8 sp|P32234|128UP_DROME sp|O55739|127L_IIV6 35.29 17 11 0 255 271 106 122 7.7 15.0
All sequences against all and adding -max_hsps 1 flag:
user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1 sp|P0C9J3|1101B_ASFWA sp|P0C9J3|1101B_ASFWA 100.00 272 0 0 1 272 1 272 0.0 564 sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481 sp|P32234|128UP_DROME sp|P32234|128UP_DROME 100.00 368 0 0 1 368 1 368 0.0 756
I would like to have the result as presented when running each sequence separately against the sequences-database, but all presented together by just running one command, i.e. 3 hits seperately but an output-list of all 9. If I run sequences.fasta as query against the database and with -max_hsps 1 flag, I only get the top hit for each query sequence. Maybe I'm misunderstanding the -max_hsps flag? I could solve this by scripting and run a seperate query for each sequence, but that would cause a loss of the multicore ability?