Question: How can I BLAST each sequence in a FASTA-file against all the other sequences in the same file?
0
gravatar for driliwyr
5.8 years ago by
driliwyr10
Germany
driliwyr10 wrote:

I'm looking for a way to BLAST each sequence in a file, protein sequences in FASTA format, against all the other sequences in the same file. I'm only interested in the best HSP per sequence-sequence pair. So, for 10 sequences I'd get 10*10=100 outputs. Any easy and efficient way to do this? I'm using BLAST (blastp) 2.2.29+.

Update 1:

Say I have 3 sequences in sequences.fasta:

>sp|P0C9J3|1101B_ASFWA Protein MGF 110-11L OS=African swine fever virus (isolate Warthog/Namibia/Wart80/1980) GN=War-022 PE=3 SV=1
MKVLLGLLLGYSVLILAHELPDLPRTQHPPKSELSYWCTYVPQCDFCWDCQDGICKNKIT
ESRFIDSNHSIVNCRVFRDSKTQSCLYEISSKMPNHFNMECLHPRPYTGNEIFMRTWGGG
DHQQLSIKQFCLYFIIGIAYTGCFVCALCKNLRLRTTMKLFILLSILVWLAQPVLNRPLS
IFYTKQILPRTYTPPMRELEYWCTYGKHCDFCWDCKNGICKNKVLDDMPLIVQNDYISKC
SITRFIDRCMYFIEPKIPYIHYMNCSLPTYFS
>sp|O55739|127L_IIV6 Uncharacterized protein 127L OS=Invertebrate iridescent virus 6 GN=IIV6-127L PE=4 SV=1
MTDYKHKINAEGEYLPFYGYTSISMLSSNTSSVQVLQNIEQLISQTIIGKYYSPLPHTTY
HMTLFNIYCMASSPIPPVQRWTLENEENKIPEKLWLSEDVLKIKHIKALDCLRKLPSHLK
ITNLEFYYKKGLGVWVTLDEESFNAVTKLRKEFSVIYEHDDANLKLHITFAYLFKELPFK
NGTREEKKALKTELLKLVKMLNTLKLWTLTNHNIYLYNSMTNYFPIDFFFSSSLV
>sp|P32234|128UP_DROME GTP-binding protein 128up OS=Drosophila melanogaster GN=128up PE=2 SV=2
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVA
KTGDARVGFVGFPSVGKSTLLSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPG
IIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKLLEHELEGFGIRLNKKPPNIY
YKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKG
QLPDYNSPVVLHNERTSIEDFCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDED
VVQIVKKV

I created a database with 

user% makeblastdb -in sequences.fasta -title sequences -dbtype prot -out sequences -parse_seqids

Running each sequence as a query against the other, separately:

user% blastp -query P0C9J3.fasta -db sequences -outfmt 6 -evalue 100
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|P0C9J3|1101B_ASFWA   sp|P32234|128UP_DROME   50.00   12      6       0       188     199     291     302     3.7     15.8
sp|P0C9J3|1101B_ASFWA   sp|O55739|127L_IIV6     33.33   12      8       0       110     121     88      99         54   11.9
user% blastp -query P32234.fasta -db sequences -outfmt 6 -evalue 100
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756
sp|P32234|128UP_DROME   sp|P0C9J3|1101B_ASFWA   50.00   12      6       0       291     302     188     199     4.5     15.8
sp|P32234|128UP_DROME   sp|O55739|127L_IIV6     35.29   17      11      0       255     271     106     122     7.7     15.0
user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   33.33   12      8       0       88      99      110     121        47   11.9
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2

O55739 but with -max_hsps 1 added (only getting one result per query-target pair, wanted behaviour):

user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2

All sequences against all:

user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|P0C9J3|1101B_ASFWA   sp|P32234|128UP_DROME   50.00   12      6       0       188     199     291     302     3.7     15.8
sp|P0C9J3|1101B_ASFWA   sp|O55739|127L_IIV6     33.33   12      8       0       110     121     88      99         54   11.9
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   33.33   12      8       0       88      99      110     121        47   11.9
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756
sp|P32234|128UP_DROME   sp|P0C9J3|1101B_ASFWA   50.00   12      6       0       291     302     188     199     4.5     15.8
sp|P32234|128UP_DROME   sp|O55739|127L_IIV6     35.29   17      11      0       255     271     106     122     7.7     15.0

All sequences against all and adding -max_hsps 1 flag:

user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756

I would like to have the result as presented when running each sequence separately against the sequences-database, but all presented together by just running one command, i.e. 3 hits seperately but an output-list of all 9. If I run sequences.fasta as query against the database and with -max_hsps 1 flag, I only get the top hit for each query sequence. Maybe I'm misunderstanding the -max_hsps flag? I could solve this by scripting and run a seperate query for each sequence, but that would cause a loss of the multicore ability?

blast • 6.0k views
ADD COMMENTlink modified 5.8 years ago by Renesh1.9k • written 5.8 years ago by driliwyr10
1
gravatar for thiagomafra
5.8 years ago by
thiagomafra70
Brazil
thiagomafra70 wrote:

I don't know if I understand well your question, but... if you to duplicate the fasta file and to use a as -query and another as -db? You also can to use -outfmt 6 (tabular format), -best_hit_score_edge 0.1.

ADD COMMENTlink written 5.8 years ago by thiagomafra70

I updated the original question with some more information. Hope that it is the correct procedure. Thanks for helping.

ADD REPLYlink written 5.8 years ago by driliwyr10
0
gravatar for 5heikki
5.8 years ago by
5heikki9.2k
Finland
5heikki9.2k wrote:
-query yourFile -subject yourFile

Or if you want multithreaded then you create a db from your file and blast against that. As thiagomafra writes, you should probably use outfmt -6 for e.g. easy removal of self hits and sorting of best hits..

ADD COMMENTlink written 5.8 years ago by 5heikki9.2k

I updated the original question with some more information. Hope that it is the correct procedure. Thanks for helping.

ADD REPLYlink written 5.8 years ago by driliwyr10
0
gravatar for Renesh
5.8 years ago by
Renesh1.9k
United States
Renesh1.9k wrote:

Run the blastp commnad with option  -max_target_seqs 2. This will give you 2 hits per query. When you do self blast, obviously the first hit will be query itself. Then delete the first hit and consider the second hit is the best match for your query.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Renesh1.9k

But I want 3 hits per query in this case. Running with -max_target_seqs 3 gives results:

user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 -max_target_seqs 3
sp|P0C9J3|1101B_ASFWA   sp|P0C9J3|1101B_ASFWA   100.00  272     0       0       1       272     1       272     0.0       564
sp|P0C9J3|1101B_ASFWA   sp|P32234|128UP_DROME   50.00   12      6       0       188     199     291     302     3.7     15.8
sp|P0C9J3|1101B_ASFWA   sp|O55739|127L_IIV6     33.33   12      8       0       110     121     88      99         54   11.9
sp|O55739|127L_IIV6     sp|O55739|127L_IIV6     100.00  235     0       0       1       235     1       235     1e-177    481
sp|O55739|127L_IIV6     sp|P32234|128UP_DROME   35.29   17      11      0       106     122     255     271     5.0     15.0
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   33.33   12      8       0       88      99      110     121        47   11.9
sp|O55739|127L_IIV6     sp|P0C9J3|1101B_ASFWA   57.14   7       3       0       120     126     58      64         91   11.2
sp|P32234|128UP_DROME   sp|P32234|128UP_DROME   100.00  368     0       0       1       368     1       368     0.0       756
sp|P32234|128UP_DROME   sp|P0C9J3|1101B_ASFWA   50.00   12      6       0       291     302     188     199     4.5     15.8
sp|P32234|128UP_DROME   sp|O55739|127L_IIV6     35.29   17      11      0       255     271     106     122     7.7     15.0

Which gives two alignments for O55739 against P0C9J3. I only want one.

I take the opportunity to ask: what is the difference between max_target_seqs, max_hsps, num_descriptions and num_alignments?

Thanks for taking the time to help!

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by driliwyr10

max_target_seqs is the maximum number of hits for the query sequence against the target database. This is incompatible with num_alignments. This means when you provide output in tabular format (-outfmt = 6), you should always use max_target_seqs. The num_alignments works with (-outfmt = 0).

Similarly,  num_descriptions does not works with outfmt > 4. This option will show one line descriptions for target sequences which has hit to query sequence.

max_hsps indicates the number of HSPs per hits. One quey may have multiple hits in same target sequences. You can control reporting of HSPs per query using this option (here understand hit and hsp concept).

I would recommend you to use max_target_seqs = 2, which will give you 2 hits per query. And then delete first hits (you can write your own code for this) and second will be the high scoring hits to your query.

ADD REPLYlink written 5.8 years ago by Renesh1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1662 users visited in the last hour
_