Blastall gives several hits to the same subject sequence
3
0
Entering edit mode
7.8 years ago

Short query sequences will sometimes give several hits to the same (often large) subject sequence. This is problematic if you ask for the 20 best hits since blastall will actually give you the best hits to 20 subject sequences. Thus using the following query:

IVLFGGSAVNGLLADTWQFDGTRWQQVAAVVANTAPNTAPNHALAYDARRELMVLYGGFDPLAPTGARSDTWEFDGHTWTARTDVSDANTRYNHAMTFDAVAGRVLLVGGHANGQMPLADYFQYDGVTWTRLLLAEPPPAGG


against refseq will yield 84 hits using the following command:

blastall -d complete.nonredundant_protein.faa -i test.faa -m 8 -b 20 -p blastp -o test.output


As stated, test.output contains 84 hits but only to 20 subject sequences. How do I get blastall to only give one hit per subject?

Furthermore blastp of the blast+ package has the same problem. Note that -num_descriptions is not possible with tabulated output and -max_target_seqs is used instead.

blastp -db complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -outfmt 6 -query test.faa

blast • 9.5k views
9
Entering edit mode
7.8 years ago
hpmcwill ★ 1.2k

The additional alignments reported for each hit are high-scoring segment pairs (HSPs). In NCBI BLAST+ the number of HSPs reported per hit (i.e. sequence in the database) can be limited using the -max_hsps option (NCBI BLAST 2.2.29+):

-max_hsps <Integer, >=0>

Set maximum number of HSPs per subject sequence to save (0 means no limit)
Default = 0'

Legacy NCBI BLAST (blastall) does not allow the number of HSPs per hit to be controlled, so post processing is required to limit the number of HSPs appearing in 'blastall' output.

For example:

Using your example query sequence to search UniProtKB/SwissProt with NCBI BLAST+ find some hits that have more than one HSP and thus appear more than once in the output:

seq1    SP:SPE26_CAEEL    29.55    132    80    6    1    127    293    416    0.004    39.7
seq1    SP:MKLN1_RAT    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_PONAB    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_MOUSE    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_HUMAN    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MDE6_SCHPO    26.40    125    72    7    1    110    277    396    0.48    33.5
seq1    SP:MDE6_SCHPO    25.00    72    49    2    42    111    266    334    2.6    31.6
seq1    SP:TYW4_YEAST    27.35    117    78    4    1    114    400    512    0.56    33.5
seq1    SP:GAN_HUMAN    26.73    101    63    6    35    131    310    403    2.0    32.0
seq1    SP:CC135_CHLRE    35.00    40    26    0    91    130    350    389    2.2    32.0
seq1    SP:CC135_CHLRE    28.46    123    71    6    1    114    310    424    6.9    30.4
seq1    SP:TAG53_CAEEL    24.81    129    82    8    15    132    377    501    4.3    30.8
seq1    SP:UTP12_YEAST    22.56    133    64    5    17    120    336    458    5.7    30.4
seq1    SP:SAHH_BACFR    24.76    105    70    4    9    113    184    279    6.1    30.4
seq1    SP:METN_BIFLO    35.48    62    31    1    51    103    307    368    7.9    30.0
seq1    SP:ATRN_MOUSE    25.00    96    56    5    42    125    502    593    9.4    30.0
seq1    SP:ANMK_PELUB    32.43    37    25    0    97    133    15    51    9.6    29.6


Adding -max_hsps 1 to the command-line limits each hit to reporting only one HSP, which eliminates the multiple reporting:

seq1    SP:SPE26_CAEEL    29.55    132    80    6    1    127    293    416    0.004    39.7
seq1    SP:MKLN1_RAT    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_PONAB    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_MOUSE    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_HUMAN    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MDE6_SCHPO    26.40    125    72    7    1    110    277    396    0.48    33.5
seq1    SP:TYW4_YEAST    27.35    117    78    4    1    114    400    512    0.56    33.5
seq1    SP:GAN_HUMAN    26.73    101    63    6    35    131    310    403    2.0    32.0
seq1    SP:TAG53_CAEEL    24.81    129    82    8    15    132    377    501    4.3    30.8
seq1    SP:UTP12_YEAST    22.56    133    64    5    17    120    336    458    5.7    30.4
seq1    SP:SAHH_BACFR    24.76    105    70    4    9    113    184    279    6.1    30.4
seq1    SP:CC135_CHLRE    28.46    123    71    6    1    114    310    424    6.9    30.4
seq1    SP:METN_BIFLO    35.48    62    31    1    51    103    307    368    7.9    30.0
seq1    SP:ATRN_MOUSE    25.00    96    56    5    42    125    502    593    9.4    30.0
seq1    SP:ANMK_PELUB    32.43    37    25    0    97    133    15    51    9.6    29.6


Update: it appears that -max_hsps replaced the -max_hsps_per_subject option in NCBI BLAST 2.2.29+ (Jan 2014). In earlier NCBI BLAST+ versions the -max_hsps_per_subject should have the same effect, but appears to be bugged. This change is not mentioned in the BLAST+ Release Notes for some reason, so I am guessing it was mixed in with some other fix/feature. For earlier versions it appears that the options are to try using other parameters to get the desired result (see Obtaining the top matches from blast) or using post processing steps to filter out the additional alignments.

0
Entering edit mode

First of all, thanks for taking the time...the blast is obviously strong with you :-)

Using Package: blast 2.2.27, build Mar 23 2013 16:48:04 I tried

blastp -db ~/genomics/NCBI/Refseq/Prot/complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -max_hsps 1 -outfmt 6 -query test.faa
Error: Unknown argument: "max_hsps"


I then tried

blastp -db ~/genomics/NCBI/Refseq/Prot/complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -max_hsps_per_subject 1 -outfmt 6 -query test.faa


but alas even though I specify -max_target_seqs 20 -max_hsps_per_subject 1 I still get 95 hits?!?

1
Entering edit mode

Looks like the -max_hsps_per_subject option was bugged... it was replaced with -max_hsps in NCBI BLAST 2.2.29+, so you might want to upgrade to the current version.

0
Entering edit mode

Will talk to sysadm and report back, thanks again!

1
Entering edit mode
7.8 years ago

The only solution I have found is to subsequently use sort on the output file. Since the list is already sorted by query, subject and then e-val, you are able to do:

cat test.output | sort -u -k1,2


Update:

If able, use blast+ version => 2.2.29 and see accepted answer else this still seems to be the only solution if you are using blast or blast+ version <2.2.29

0
Entering edit mode
19 months ago
harelarik ▴ 90

This could help to get only one subject per query:

set -max_target_seqs to 1 [you must set outfmt to any option >4]

Warning- you may get self hits.

If your query sequence is present in the database you search in, you will probably get self hits if you ask only for 1 hit.

If you still want only one hit, and the command above does not work you may consider to use: -num_descriptions and -num_alignments` to 1