Question: Blastall gives several hits to the same subject sequence
0
gravatar for gislevestergaard
4.5 years ago by
Germany
gislevestergaard0 wrote:

Short query sequences will sometimes give several hits to the same (often large) subject sequence. This is problematic if you ask for the 20 best hits since blastall will actually give you the best hits to 20 subject sequences. Thus using the following query: IVLFGGSAVNGLLADTWQFDGTRWQQVAAVVANTAPNTAPNHALAYDARRELMVLYGGFDPLAPTGARSDTWEFDGHTWTARTDVSDANTRYNHAMTFDAVAGRVLLVGGHANGQMPLADYFQYDGVTWTRLLLAEPPPAGG against refseq will yield 84 hits using the following command:

blastall -d complete.nonredundant_protein.faa -i test.faa -m 8 -b 20 -p blastp -o test.output

As stated, test.output contains 84 hits but only to 20 subject sequences. How do I get blastall to only give one hit per subject?

Furthermore blastp of the blast+ package has the same problem. Note that -num_descriptions is not possible with tabulated output and -max_target_seqs is used instead.

blastp -db complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -outfmt 6 -query test.faa

 

blast • 6.1k views
ADD COMMENTlink modified 4.5 years ago by hpmcwill1.1k • written 4.5 years ago by gislevestergaard0
8
gravatar for hpmcwill
4.5 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

The additional alignments reported for each hit are high-scoring segment pairs (HSPs). In NCBI BLAST+ the number of HSPs reported per hit (i.e. sequence in the database) can be limited using the '-max_hsps' option (NCBI BLAST 2.2.29+):

 -max_hsps <Integer, >=0>

   Set maximum number of HSPs per subject sequence to save (0 means no limit)
   Default = `0'

Legacy NCBI BLAST (blastall) does not allow the number of HSPs per hit to be controlled, so post processing is required to limit the number of HSPs appearing in 'blastall' output.

For example:

Using your example query sequence to search UniProtKB/SwissProt with NCBI BLAST+ find some hits that have more than one HSP and thus appear more than once in the output:

seq1    SP:SPE26_CAEEL    29.55    132    80    6    1    127    293    416    0.004    39.7
seq1    SP:MKLN1_RAT    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_PONAB    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_MOUSE    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_HUMAN    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MDE6_SCHPO    26.40    125    72    7    1    110    277    396    0.48    33.5
seq1    SP:MDE6_SCHPO    25.00    72    49    2    42    111    266    334    2.6    31.6
seq1    SP:TYW4_YEAST    27.35    117    78    4    1    114    400    512    0.56    33.5
seq1    SP:GAN_HUMAN    26.73    101    63    6    35    131    310    403    2.0    32.0
seq1    SP:CC135_CHLRE    35.00    40    26    0    91    130    350    389    2.2    32.0
seq1    SP:CC135_CHLRE    28.46    123    71    6    1    114    310    424    6.9    30.4
seq1    SP:TAG53_CAEEL    24.81    129    82    8    15    132    377    501    4.3    30.8
seq1    SP:UTP12_YEAST    22.56    133    64    5    17    120    336    458    5.7    30.4
seq1    SP:SAHH_BACFR    24.76    105    70    4    9    113    184    279    6.1    30.4
seq1    SP:METN_BIFLO    35.48    62    31    1    51    103    307    368    7.9    30.0
seq1    SP:ATRN_MOUSE    25.00    96    56    5    42    125    502    593    9.4    30.0
seq1    SP:ANMK_PELUB    32.43    37    25    0    97    133    15    51    9.6    29.6

Adding '-max_hsps 1' to the command-line limits each hit to reporting only one HSP, which eliminates the multiple reporting:

seq1    SP:SPE26_CAEEL    29.55    132    80    6    1    127    293    416    0.004    39.7
seq1    SP:MKLN1_RAT    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_PONAB    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_MOUSE    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MKLN1_HUMAN    25.00    136    83    6    23    142    250    382    0.061    36.2
seq1    SP:MDE6_SCHPO    26.40    125    72    7    1    110    277    396    0.48    33.5
seq1    SP:TYW4_YEAST    27.35    117    78    4    1    114    400    512    0.56    33.5
seq1    SP:GAN_HUMAN    26.73    101    63    6    35    131    310    403    2.0    32.0
seq1    SP:TAG53_CAEEL    24.81    129    82    8    15    132    377    501    4.3    30.8
seq1    SP:UTP12_YEAST    22.56    133    64    5    17    120    336    458    5.7    30.4
seq1    SP:SAHH_BACFR    24.76    105    70    4    9    113    184    279    6.1    30.4
seq1    SP:CC135_CHLRE    28.46    123    71    6    1    114    310    424    6.9    30.4
seq1    SP:METN_BIFLO    35.48    62    31    1    51    103    307    368    7.9    30.0
seq1    SP:ATRN_MOUSE    25.00    96    56    5    42    125    502    593    9.4    30.0
seq1    SP:ANMK_PELUB    32.43    37    25    0    97    133    15    51    9.6    29.6

Update: it appears that '-max_hsps' replaced the '-max_hsps_per_subject' option in NCBI BLAST 2.2.29+ (Jan 2014). In earlier NCBI BLAST+ versions the '-max_hsps_per_subject' should have the same effect, but appears to be bugged. This change is not mentioned in the "BLAST+ Release Notes" for some reason, so I am guessing it was mixed in with some other fix/feature. For earlier versions it appears that the options are to try using other parameters to get the desired result (see "Obtaining the top matches from blast" for some possibilities) or using post processing steps to filter out the additional alignments.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by hpmcwill1.1k

First of all, thanks for taking the time...the blast is obviously strong with you :-)

Using Package: blast 2.2.27, build Mar 23 2013 16:48:04 I tried

blastp -db ~/genomics/NCBI/Refseq/Prot/complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -max_hsps 1 -outfmt 6 -query test.faa

Error: Unknown argument: "max_hsps"

I then tried

blastp -db ~/genomics/NCBI/Refseq/Prot/complete.nonredundant_protein.faa -out test.blastplus -max_target_seqs 20 -max_hsps_per_subject 1 -outfmt 6 -query test.faa

but alas even though I specify -max_target_seqs 20 -max_hsps_per_subject 1 I still get 95 hits?!?

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by gislevestergaard0
1

Looks like the '-max_hsps_per_subject' option was bugged... it was replaced with '-max_hsps' in NCBI BLAST 2.2.29+, so you might want to upgrade to the current version (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/).

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by hpmcwill1.1k

Will talk to sysadm and report back, thanks again!

ADD REPLYlink written 4.5 years ago by gislevestergaard0
0
gravatar for gislevestergaard
4.5 years ago by
Germany
gislevestergaard0 wrote:

The only solution I have found is to subsequently use sort on the output file. Since the list is already sorted by query, subject and then e-val, you are able to do:

cat test.output | sort -u -k1,2

Update:

If able, use blast+ version => 2.2.29 and see accepted answer else 

this still seems to be the only solution if you are using blast or blast+ version <2.2.29

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by gislevestergaard0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1190 users visited in the last hour