Question: (solved) I couldn't reproduce the problem of max_target_seqs
4
gravatar for fishgolden
12 months ago by
fishgolden420
fishgolden420 wrote:

UPDATE (2018/10/12) old contents are removed to make the point of this entry much clearer:

I made fasta files which reproduce the problem written in https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166 & https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 .

please use

http://miscsources.sakura.ne.jp/blast_2018/target.fas

as query. The sequence is nHd.2.3.1.t00019-RA which is mentioned in the 2015 gist post https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 .

Please use

http://miscsources.sakura.ne.jp/blast_2018/dummydb.fas

http://miscsources.sakura.ne.jp/blast_2018/dummydb51.fas

http://miscsources.sakura.ne.jp/blast_2018/dummydb_mod.fas

as database.

I used makeblastdb and blastp command in blast+2.7.1 for windows 64 bit.

The databases contain 2 (KRX89030, WP_042303394 ) sequences which are family of sequences mentioned in

https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5

However,

dummydb.fas contains 52 copies of KRX89030.1 .

dummydb51.fas contains 51 copies of KRX89030.1 .

and both have one copy of WP_042303394.1 .

dummydb_mod.fas contains 52 copies of KRX9030.1 and difference with dummydb.fas will be explained later.

Why 52 copies are needed is because BLAST decides number of hit sequences to retain (after the firs step? not sure) as follows

blast_hits.c

   prelim_hitlist_size = hit_options->hitlist_size;
   if (ext_options->compositionBasedStats)
        prelim_hitlist_size = prelim_hitlist_size * 2 + 50;
   else if (scoring_options->gapped_calculation)
        prelim_hitlist_size = MIN(2 * prelim_hitlist_size,
                                  prelim_hitlist_size + 50);

   (*retval)->prelim_hitlist_size = MAX(prelim_hitlist_size, 10);

Using dummydb.fas, the problem is reproducible.

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA     WP_042303394.1  58.678  121     49      1       4       124     93      212     7.99e-047       153
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     3.66e-042       140

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     3.66e-042       140

yey.

Using dummydb51.fas, the problem does not occur.

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb51.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA     WP_042303394.1  58.678  121     49      1       4       124     93      212     7.84e-047       153
nHd.2.3.1.t00019-RA     seq51   63.115  122     45      0       1       122     105     226     3.59e-042       140

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb51.fas -outfmt 6 -max_target_seqs 1
nHd.2.3.1.t00019-RA     WP_042303394.1  58.678  121     49      1       4       124     93      212     7.84e-047       153

If the problem is caused by the ungapped alignment step as NCBI staff mentioned (update 2018/11/03 according to the https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ it performs "a gapped extension based on the gap free extension") , when the sequence showed higher score with ungapped alignment, the problem will not occur.

Apparently, WP_042303394.1 has lower score than KRX89030.1 with ungapped mode.

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb.fas -outfmt 6 -max_target_seqs 53 -comp_based_stats F -ungapped
...omitted...
nHd.2.3.1.t00019-RA     seq3    63.115  122     45      0       1       122     105     226     1.32e-052       194
nHd.2.3.1.t00019-RA     seq2    63.115  122     45      0       1       122     105     226     1.32e-052       194
nHd.2.3.1.t00019-RA     seq1    63.115  122     45      0       1       122     105     226     1.32e-052       194
nHd.2.3.1.t00019-RA     WP_042303394.1  60.000  105     42      0       4       108     93      197     8.05e-042       158
nHd.2.3.1.t00019-RA     WP_042303394.1  47.368  19      10      0       106     124     194     212     0.020   27.6

Then if I modify WP_042303394.1 to have higher score with ungapped mode, the problem does not occur. dummy_mod.fas contains modified WP_042303394.1, which was named A_WP_042303394.1

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 2 -comp_based_stats F -ungapped
nHd.2.3.1.t00019-RA     A_WP_042303394.1        63.636  121     44      0       4       124     93      213     3.99e-054       199
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     1.33e-052       194

C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 2
nHd.2.3.1.t00019-RA     A_WP_042303394.1        63.636  121     44      0       4       124     93      213     1.45e-053       172
nHd.2.3.1.t00019-RA     seq52   63.115  122     45      0       1       122     105     226     3.66e-042       140

 C:\misc\gistup>C:\misc\ncbi-blast-2.7.1+\bin\blastp.exe -query target.fas -db dummydb_mod.fas -outfmt 6 -max_target_seqs 1
 nHd.2.3.1.t00019-RA     A_WP_042303394.1        63.636  121     44      0       4       124     93      213     1.45e-053       172

yey!

With these results, I don't think

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166

BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits.

Please let me know if you have any problems to reproduce my result. (Scoring system of BLAST is very complicated thus the procedure of the last part (etimating behavior using scores in ungapped mode) may not work with other sequences.)

blast • 1.4k views
ADD COMMENTlink modified 7 months ago by gb1.0k • written 12 months ago by fishgolden420
1

the paper states that the order is set by the order in the database, hence if you have the "right" order relative to the search the parameter will have no effect.

To verify what the paper states take the best hit that you have, move it last in the FASTA file, reindex with blast and test again

ADD REPLYlink modified 12 months ago • written 12 months ago by Istvan Albert ♦♦ 81k

I am kind of surprised that the note was accepted without details of what exactly was done, how it was done and what it was done with.

I just listened to a blast webinar at NCBI and it sounded like this was not an issue (at least in the example that was shown of a search with 1 max_target_seq and 10. Recommendation was not to use just 1 for any searches (use at least 10 and then filter).

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

IMO that recommendation is still kind of suspicious - it either works as we think it does or not - this sort of special cases are even worse (if it worked one way in some cases and the other way in some others)

that being said I cannot reproduce the behavior either:

# Get the 16S Microbial data
update_blastdb.pl --decompress 16SMicrobial

# Take the last entry in the database
blastdbcmd -db 16SMicrobial -entry all -outfmt '%a' | tail -1

# The last entry in the database is NR_158155.1

# Get the sequence for the last entry
blastdbcmd -db 16SMicrobial -entry 'NR_158155.1' > s.fa

# Get the best hit when searching for all.
blastn -db 16SMicrobial -query s.fa -outfmt 6 | head -1

# Get the best hit when limiting for 1 sequence.
blastn -db 16SMicrobial -query s.fa -outfmt 6 -max_target_seqs 1 | head -1

the script prints:

NR_158155.1
NR_158155.1 NR_158155.1 100.000 1409    0   0   1   1409    1   1409    0.0 2603
NR_158155.1 NR_158155.1 100.000 1409    0   0   1   1409    1   1409    0.0 2603

so I get the same result when limiting to 1

ADD REPLYlink modified 12 months ago • written 12 months ago by Istvan Albert ♦♦ 81k

I get the same result with blast v. 2.7.1 on linux.

Taking a random ID from same database:

$ blastn -db 16SMicrobial -query s.fa -outfmt 6 -max_target_seqs 10 | head -5
NR_158067.1 NR_158067.1 100.000 1473    0   0   1   1473    1   1473    0.0 2721
NR_158067.1 NR_113517.1 94.508  1475    76  5   1   1473    1   1472    0.0 2270
NR_158067.1 NR_113516.1 94.441  1475    77  5   1   1473    1   1472    0.0 2265
NR_158067.1 NR_102445.1 94.456  1461    76  5   15  1473    1   1458    0.0 2244
NR_158067.1 NR_136772.1 94.533  1445    74  5   2   1445    1   1441    0.0 2226

$ blastn -db 16SMicrobial -query s.fa -outfmt 6 -max_target_seqs 1 
NR_158067.1 NR_158067.1 100.000 1473    0   0   1   1473    1   1473    0.0 2721
ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

now that I read the report a little better - I want to mention that my example here may not trigger the bug. As the response from NCBI reply states

In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits.

it just might be that my example does not hit this scenario. Which only makes the behavior more confusing IMO, it seems to work until it does not.

ADD REPLYlink modified 12 months ago • written 12 months ago by Istvan Albert ♦♦ 81k

Ohohoh you are right, you are right. & NCBI staff is right. please use query

>nHd.2.3.1.t00019-RA
MSNLGITDPCVDAMNSLGLKLEELQDLEVDAGLGNGGLGRLAACFMDSLATLSIPAIGYGIRYEFGIFNQRVINGEQVEE
RDDWLEFGDPWEKLRQDKKISVYFNGKTYVDKEGRSHWVDTQQIVD

which is mentioned in the 2015 gist post Perform blastp search against nr using max_target_seqs 100 and 10000 (web blast is fine).

You can see apparent difference between the two.

check the Gap numbers of the entries which are only seen in max_target_seqs 10000.

Yeeeeeeey I can sleep well good night.

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

fishgolden : I have done the blastp search against nr with your protein sequence with three limits 5, 100, 500. Top two hits are the same (with some differences later)

==> out.1e-5.max100.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141

==> out.1e-5.max500.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140

==> out.1e-5.max5.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141
ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

nr is updated so there are much more eukaryotic genes now compared with 2015. 500 is not enough I think.

ADD REPLYlink written 12 months ago by fishgolden420

For 5, 100, 500, 1000, 5000, 10000. Searches done locally.

==> out.1e-5.max5.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141    

==> out.1e-5.max100.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140
nHd.2.3.1.t00019-RA KRX89025.1  63.115  122 45  0   1   122 105 226 1.54e-36    140
nHd.2.3.1.t00019-RA KFD69381.1  61.983  121 46  0   1   121 466 586 2.00e-36    141

==> out.1e-5.max500.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA KRX89027.1  63.115  122 45  0   1   122 105 226 7.55e-37    140

==> out.1e-5.max1000.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA WP_114812007.1  57.851  121 50  1   4   124 93  212 1.76e-39    150

==> out.1e-5.max5000.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA WP_114812007.1  57.851  121 50  1   4   124 93  212 1.76e-39    150

==> out.1e-5.max10000.txt <==
nHd.2.3.1.t00019-RA OQV20892.1  100.000 122 0   0   1   122 1   122 5.13e-85    253
nHd.2.3.1.t00019-RA GAV08169.1  75.806  124 30  0   1   124 51  174 5.96e-45    162
nHd.2.3.1.t00019-RA WP_042303394.1  58.678  121 49  1   4   124 93  212 1.08e-40    153
nHd.2.3.1.t00019-RA WP_017775351.1  58.678  121 49  1   4   124 93  212 1.22e-40    153
nHd.2.3.1.t00019-RA WP_114812007.1  57.851  121 50  1   4   124 93  212 1.76e-39    150
ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

Sorry, I misunderstood what you said. The third or forth hits of 500 are the ones (or family of ones) which the gist author mentioned (those are from bacteria (Paraburkholderia kururiensis)). More similar eukaryotic sequences were added after the author's post.

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

The statistics are so close on later hits that their order may be re-arranged a bit in the display. Top two hits are clearly different. Once we get into 1e-39 and 1e-40 territory those are very similar.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

I think the ones who address this problem as "bug" do not care about e-value differences.

They just mention the sequences exist or not.

ADD REPLYlink written 12 months ago by fishgolden420

Please perform text search for "UniRef50_A7E2S9" and "UniRef50_A0A0D9RFQ1" in this page or gist.

If the order in the database decides the order in the results, the hits appeared with "-max_target_seq 1" for ”testfas1.fas” and "testfas2.fas" must be the same.... I think.

ADD REPLYlink written 12 months ago by fishgolden420

@Istvan: I tried moving the fasta sequence around in a file, re-indexing the file. The result does not change. So blastn is not affected. Let me see if I can test with blastp.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

now that I looked at it in more detail there is more to it, the "bug" or "behavior" is triggered only in some special cases.

In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits.

You have to have an alignment that matches the statement above.

Look at the examples that Brad Chapman has on the github page and search against NR so that there is ample room for the behavior to trigger.

ADD REPLYlink modified 12 months ago • written 12 months ago by Istvan Albert ♦♦ 81k
1

C: Misunderstood parameter of NCBI BLAST (Response from NCBI in NAR posted by @Vijay)

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax73k

Thank you for the notification! I was very surprized that there was a bug. It made problems complicated........

ADD REPLYlink written 10 months ago by fishgolden420

Just to be certain, you are using exactly the same versions of blast/database etc as referred in the paper?

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

I couldn't find the version number in the paper. In fact, it's a letter made from 2 pages. There are not example commands and queries and databases which result in the problem. Thus I wrote "problem" not "results".

ADD REPLYlink written 12 months ago by fishgolden420

The other researcher uses Linux & 2.2.28~31 I'll check this tomorrow. 10/5 : I updated main text & uploaded the result file. https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

fishgolden : We can't see the screenshots.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k

me too. it is the post in 3 yeas ago.

ADD REPLYlink written 12 months ago by fishgolden420

I am referring to your screenshot: https://www.dropbox.com/s/8z1kso2d53k3l5k/max500.png?dl=0

ADD REPLYlink written 12 months ago by genomax73k

It come from the preview of the link of gist I pasted. Biostars changes it automatically. Sorry it is confusing. I changed.

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

genomax Thank you.

ADD REPLYlink modified 12 months ago • written 12 months ago by fishgolden420

Recommendation: show the alignments as text and not images

also remove the unnecessary data listings, those make the post difficult to follow

ADD REPLYlink written 12 months ago by Istvan Albert ♦♦ 81k

Thank you for the suggestion. I'll do it when I have a time. (Unfortunately, I'm a little bit busy for the next several days.)

ADD REPLYlink written 12 months ago by fishgolden420
5
gravatar for Peter
11 months ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

It took me a while to publish this on GitHub, but I made a self-contained test case very close to the original 2015 NR example using current hits including WP_042303394.1 and KRX89030.1 but not the tardigrade sequences which were added after 2015:

I really like your approach to make test databases containing one copy of WP_042303394.1 and lots of copies of KRX89030.1.

Your database and mine both reproduce the original problem reported by Sujai in December 2015, but like you I was not able reproduce the claim from Shar et al. (2018) about the database order being important (other than in the trivial case of a tie breaker for identical scores and e-value). I've included a link to your example in the followup post:

See also this description of the BLAST algorithm and the internal limit of 2*N+50 which matches the source code you quoted, and explains why with -max_target_seqs 1 the threshold is 52 candidates, and thus your databases with 52 or 53 sequences gave different answers:

ADD COMMENTlink written 11 months ago by Peter5.8k
2

Thank you! https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ is a good explanation. It matches with what I could understand from the source codes of BLAST.

The matter I was referring is written in C4 as N_i. And what lieven.sterck referred is written in C3 as X_g. (With my understandings).

ADD REPLYlink written 11 months ago by fishgolden420
2
gravatar for lieven.sterck
12 months ago by
lieven.sterck6.1k
VIB, Ghent, Belgium
lieven.sterck6.1k wrote:

shamelessly abusing the 'answer' functionality, just to create a side issue here:

I want to point out that this kind of behaviour is not only linked to the -max_target_seq parameter but to all filtering parameters (eg. Evalue).

See this post:

https://bioinformatics.stackexchange.com/questions/3409/blast-hits-disappearing-after-changing-evalue/4146

I once read one on the Evalue specific in which they showed you can go from a prokaryotic to an eukaryotic best hit, but can't seem to find it back (will add if I do)

ADD COMMENTlink modified 12 months ago • written 12 months ago by lieven.sterck6.1k
1

Considering how widely BLAST is used and how often people filter on e-values no wonder that we have an irreproducibility problem in life sciences

ADD REPLYlink written 12 months ago by Istvan Albert ♦♦ 81k

Istvan Albert : Perhaps I am missing the point but this isn't a make or break type situation, correct? If you are blindly selecting just one hit from a blast search then either your application needs are casual or you are not doing the analysis as carefully as you should.

Based on the tests I did above there is hardly any difference in results. If you are filtering at 1e-39 vs 1e-40 there is not much to say. This is akin to setting a p-value filter in RNAseq data analysis.

Irreproducibility problem based on e-value filtering may have more to do with new sequences being added to the database all the time. We are not using a stable blast DB release like we do for genomes.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k
1

My point is that there is a responsibility that toolmakers have to properly document and make a note of situations that might lead to misunderstandings. It is insufficient to point to the documentation (fine print).

Especially in the light that this was noted in 2015, most people that can be considered experts agreed that it was wholly unreasonable and NCBI did nothing about it (or perhaps fixed their web search, hence "pushing the dirt" even further under the rug).

The problem is not as much that the parameter exists or that is called this or that - everyone makes mistakes - it is the arrogance of not doing anything about it for three years is what annoys me the most.

ADD REPLYlink modified 12 months ago • written 12 months ago by Istvan Albert ♦♦ 81k

You're absolutely correct on the DB side of this issue genomax and this will certainly add to the whole irreproducibility . But there is still on underlying issue with the blast. The main issue is that people (myself included up to a few months ago) that these kind of thresholds would only affect the 'reporting' but it turns out they are actually influencing the 'algorithm' part of blast causing the observations.

Concerning your test case: the test would be better if you used a protein sequence not (yet) present in the the database (regardless of anything, blast does, in most cases, always finds the identical sequence back). I suggest you select one from the DB, remove it then from the DB, then check the best hit and start moving that one around. But I'm also not really convinced, as stated here somewhere as well, that the issue is 'position in DB related'

ADD REPLYlink written 12 months ago by lieven.sterck6.1k
2

these kind of thresholds would only affect the 'reporting' but it turns out they are actually influencing the 'algorithm' part of blast causing the observations.

But has this produced a result that is invalid or toally inaccurate. I don't think BLAST guarantees an optimal solution every time. If you need to ensure that then Smith-Waterman should be used instead.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax73k
1

issue (where also Peter & A: (solved) I couldn't reproduce the problem of max_target_seqs is involved) described here might catalogue as such I think

ADD REPLYlink written 11 months ago by lieven.sterck6.1k

I agree with you " I don't think BLAST guarantees an optimal solution every time. If you need to ensure that then Smith-Waterman should be used instead.".

Heuristically, I cannot decide which group of sequences is wanted

  • A group of sequences which have high scores in preliminary step but lower scores in the final step
  • A group of sequences which have lower scores in preliminary step but higher scores in the final step

when both groups have very small evalue in the final step.

I'm a greedy person that I want to get both groups. But how about the person who set "-max_target_seqs 1" ...? The latter one? But the former has higher scores at the preliminary step (with the algorithm differs from that of final step) it means that the former is better than the latter in different point of view.

On the other hands, if somebody says "should be better documented", I also agree with it.

https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ is good document. I’m wondering when this document was made. Last update is May 18, 2016. My bad if it has been two years.

(edited 2018/11/03/23:11JST)

ADD REPLYlink modified 11 months ago • written 11 months ago by fishgolden420
1

I asked by email and https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ was added in October 2018 (which I mentioned when I linked to it in https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html ), but they weren't sure where the last modified date came from and why it had not updated. Note this was added after the Shah et al 2018 letter was published.

ADD REPLYlink modified 11 months ago by genomax73k • written 11 months ago by Peter5.8k

Thank you for the information! & Thank you for taking the trouble to ask the support. I have checked google chache and archive.is but couldn't confirm the date.

ADD REPLYlink written 11 months ago by fishgolden420
1

I checked too, before simply asking - the reason there is a cache on archive.org from 2 November is because I requested the snapshot ;)

ADD REPLYlink written 11 months ago by Peter5.8k

I think the e-value matter is caused in the probability distribution estimation step. (Just I think. There are no evidence.)

(update 2018/11/03)

I was 90% wrong. Some estimations (evalue to score) were done but according to the https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_ the filterings were done at the step same as the max_target_seqs.

ADD REPLYlink modified 11 months ago • written 12 months ago by fishgolden420

likely, but common interpretation will also be that this is done at the end (== the 'reporting' stage)

ADD REPLYlink written 12 months ago by lieven.sterck6.1k

Yes, e-value is also used as a filter before the final step - this was also noted in the comment from Harm Nijveen on 3 Dec 2015 on the original max_target_seqs report from Sujai (I would link to it but BioStars insists on embedding the entire gist)

ADD REPLYlink modified 11 months ago • written 11 months ago by Peter5.8k
0
gravatar for gb
7 months ago by
gb1.0k
gb1.0k wrote:

I know this tread is old and solved but wanted to add something if people run into the same "problem". If you use -perc_identity 100 it can happen that you will not get a hit at all even that the database contains the same sequence as the query.

In my case I do not have a hit at all with -max_target_seqs 1 -perc_identity 100 or even -max_target_seqs 7 -perc_identity 100. But when I set the -max_target_seqs higher then 8 blastn will find and return the right hit (and only one because there are no other perfect matches).

If I do -max_target_seqs 1 -perc_identity 99 I get the another hit which is not exactly the same as my query.

If I do -max_target_seqs 8 -perc_identity 99 I get 8 hits and none of them is exactly the same as my query. I used -max_target_seqs 8 because that was the treshold in the first testcase.

I need to do -max_target_seqs 16 -perc_identity 99 to get my exact match in the results.

EDIT:

Thanks to the comment of genomax. I used blast-2.6.0 and 2.8.1 and used a custom selection of the nt database.

ADD COMMENTlink modified 7 months ago • written 7 months ago by gb1.0k

This thread is only 5 months old.

You need to add information about version of blast+ and what databases you are using. If these observations are for a custom database then these observations may only apply to your specific case.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax73k

Yes they do only apply for me, specially these specific parameters. I wanted to point out that instead of a different result you can also get no hits at all because of the -max_target_seqs parameter. Maybe for some it is obvious but for others they can find the cause to there problem.

If it does not fit here I can remove the post ofcourse.

ADD REPLYlink written 7 months ago by gb1.0k

I made several dummy dbs and ran searches several times but I couldn't reproduce the problem.

There are many things to consider.

  • How long is the length of the query sequence?
  • Does it contain N?
  • How long is the lengths of -perc_identity 99 hits?
  • Are they masked?
  • How are the alignments with them?
  • How is the result with the option "-ungapped"?

etc.

To investigate deeper, we need reproducible examples.

ADD REPLYlink modified 6 months ago • written 6 months ago by fishgolden420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 3566 users visited in the last hour