Query Vs. Target Using Blast2
2
0
Entering edit mode
13.2 years ago
Guidobot ▴ 20

I'm running blast2 locally using DNA sequences in fasta files. I have one large target and a load of smaller query sequences.

Version: 1.2.2.21.20090809-1 (blast2)

blast2 -p blastn -r 0 -q -2 -G 2 -E 2 -W 4 -e 0.1 -i queries.fasta -j target.fasta -o qt_blast.txt

When I use blast2 with individual queries (e.g. -i query_51.fasta) against the target the alignments work as expected (multiple HSPs per query). But if all the same query sequences are in one file (e.g. -i queries.fasta) I only get a just few alignment results (e.g. the last few queries).

Why is this happening and how can I use blast2 to perform a many-queries-to-single-target alignment?

Also, if I swap the files over, i.e. multiple small targets in one file and one large query:

blast2 -p blastn -r 0 -q -2 -G 2 -E 2 -W 4 -e 0.1 -i target.fasta -j queries.fasta -o qt_blast.txt

it gives different results again. In this case a single local alignment (the best HSP) for the (long) query to each (short) target. I'd be grateful if someone could explain why it is important which sequence is target and which query [with the blastn algorithm]. Cheers.

blast target • 8.5k views
ADD COMMENT
1
Entering edit mode

Guidobot, could you post which version of blast you are using and your blast2 command line parameters?

ADD REPLY
0
Entering edit mode
Version: 1.2.2.21.20090809-1 (blast2)
Command:
blast2 -p blastn -r 0 -q -2 -G 2 -E 2 -W 4 -e 0.1 -i queries.fasta -j target.fasta -o qt_blast.txt

(so more specifically I'm running blastn)

ADD REPLY
0
Entering edit mode

please edit question and insert this new information there

ADD REPLY
0
Entering edit mode

please edit your question and insert this new information there

ADD REPLY
2
Entering edit mode
13.2 years ago
Raquel Tobes ▴ 160

Blast2 experiments are not symmetric and the results depend on the sequence selected as query (-i) and the sequence selected as target (-j). The second sequence (-j) statistically has the role of database sequence. The database size is fundamental in the statistic evaluation of a hit (E-value).

You can find information about that in: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/bl2seq.html#5

You can set the size of the database to a constant value independent of the size of the second sequence (-j).


This has been extracted from the above url:

*Option -d Function Theoretical database size Default 0 Example To use a theoretical database size of 2000000, use: -d 2000000

Note:

Default is to use actual size of the second query. We can use this parameter to provide the actual size of a real database such as protein nr to get a more realistic Expect value for the returned protein alignment.*


In addition, the internal heuristics of BLAST implies that the same BLAST experiment can yield results slightly different.

With regard to multiple query sequences (-i) using BLASTALL we have detected limits in the number of sequences that allows to obtain a complete output file. Our solution was to split the query in several sets (our experimental limit was around 3000 sequences in the query). In the case of BLAST2 you can construct a script to generate an independent blast2 for each short sequence (-i) against the large sequence (-j).

Statistically -i is considered the query and -j the database.

ADD COMMENT
0
Entering edit mode

Thank you for the info. and link. This confirms the asymmetry I see but does not explain the behavioral difference. That is, given a target sequence (database) of ~2Mb and query of 500b I get a nice ~300b match at E=3e-74. But by using the short sequence as target I only get few small matches (e.g. 11b) with E>10. I do not see why the exact same 300b match is not found, even if the E value is (largely) determined by the size of the target.

There must be some algorithmic reason why correct alignment is not found. Perhaps because the query is longer than the target?

ADD REPLY
0
Entering edit mode

With respect to a (hidden) maximum number of query sequences, I ran a test with only 8 queries and this did indeed run perfectly, i.e. the same results as the quires ran individually! However, I got a near total failure when trying only 1654 query sequences. I.e. the same sequences giving great hits were given 0 matches and there was nothing to warn me that some internal limit was exceeded. What was the average length of your 3000 queries?

ADD REPLY
0
Entering edit mode

About 400bp using BLASTX.

Other kind of hidden error that we have detected is related to the writting of xml format output file of BLAST results: in some cases a set of results are lost and are not included in the xml format output. It is really important to take into account this kind of error that you can detect checking that the last iteration ID (Iteration_iter-num) is identical to the number of [?] elements in the xml output file.

ADD REPLY
0
Entering edit mode

With regard to the asymemtry the algorithm implies that the results are different if you interchange target by query. The initial step of BLAST algorithm is to create a deterministic finite state machine to search all words of a size W that are over a similarity threshold with any word of the query of W size (no gaps are allowed in this alignment between words). This W word match is the seed of the alignments that are extended in the next steps of the algorithm. This point could cause different results if you change target by query.

ADD REPLY
0
Entering edit mode

In addition perhaps the position of the similarity region could be more advantageous in a specific case. If the similarity region is placed at the end of the query could be easily detectable as query (it is only a possibility)

ADD REPLY
1
Entering edit mode
13.2 years ago
lh3 33k

Raquel has answered the query vs. target part: they are treated very differently in blast. In general, the proper way in your case is to use many short sequences as queries. Nonetheless, blast1 will be extremely inefficient because it scans the long target for each query, which is very costly. I do not know if blast2 has changed this behavior.

A further question is why you use "-r 0" (a matching score zero). Actually I am surprised that blast even works in that case, because all alignments are non-positive. Also, a word size 4 will make blast very inefficient. Probably fasta will work better. If you want to map short reads, you should use a dedicated short-read mapper that works by far better than blast.

ADD COMMENT
0
Entering edit mode

I basically have relatively long reads with lots of (INDEL) errors. Hence the short (minimum) word size used. It's also why most short-read mappers are no good for this type of alignment. The particular settings for match/mismatch and gap scores were partly governed by what combinations blast2 would accept. I've looked at other (open-source) alignment tools but so far haven't found exactly what I need and I'm very open to suggestions. But I'd also like to know if/how I can use blast2 for multiple queries to a target, which I wouldn't have expected to be an issue.

ADD REPLY
0
Entering edit mode

Actually, -r 0 was not a good choice as it means alignments at the ends are missed. My aim was to try to make gaps equivalent penalties to mismatches but this does not seem entirely possible with the restrictions on the combinations of match, mismatch and gap parameters. I'm now using -r 1 -q -1 -G 1 -E 2.

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6