enable the efficient processing of large data sets, researchers
frequently rely on shortcuts aimed at reducing the
number of BLAST results that need to be processed. A
common strategy involves using the "-
max_target_seqs" parameter of the NCBI BLAST+
suite. According to the BLAST documentation itself
(2008-), this parameter represents the "number of aligned
sequences to keep". This statement is commonly interpreted
as meaning that BLAST will return the top N database
hits for a sequence query if the value of
max_target_seqs is set to N. For example, in a recent
article (Wang, et al., 2016) the authors explicitly state
"Setting “max target seqs” as “1,” only the best match
result was considered."
To our surprise, we have recently discovered that
this intuition is incorrect. Instead, BLAST returns the first
N hits that exceed the specified E-value threshold, which
may or may not be the highest scoring N hits. The invocation
using the parameter "-max_target_seqs 1"
simply returns the first good hit found in the database, not
the best hit as one would assume.
Worse yet, the output
produced depends on the order in which the sequences
occur in the database. For the same query, different results
will be returned by BLAST when using different
versions of the database even if all versions contain the
same best hit for this database sequence.