(BLAST) Why different e-values if I query a sequence alone, or together with other sequences?
2
1
Entering edit mode
19 months ago
johan ▴ 110

Hi, I need to create a table with how e-values are distributed for some sequences, as a way of reporting how conserved the sequences are.

I got some inconsistent results, and boiled it down to if I query a sequence alone, or if I query it together with other sequences. The result page has a drop-down menu where you can only pick a single query sequence. So I assume that it is independent of the other query sequences?

Here is an example to show it. In the first test, I query ">1" alone, and the top hits are 4e-9. In the second test, I query ">1" together with ">0" and ">2", and when I look at only ">2", the top hits are "3e-9"

First test set-up: https://i.imgur.com/E3xJqOj.png

First test results: https://i.imgur.com/AOccFBj.png

Second test set-up: https://imgur.com/uYYnnkC Second test results: https://imgur.com/v3SJUZ7

I just did the same test with other sequences, and I got either 0.35 or 0.1 as the top hits. All settings are identical between the two searches. I just go to nucleotide BLAST, enter my queries, enter an organism, change to "blastn", and change the number of hits to 20.000. All other settings are the defaults.

So what is the correct way of doing this search? I'm so confused at the moment :<

BLAST e-value • 792 views
ADD COMMENT
3
Entering edit mode
18 months ago
johan ▴ 110

Update. The NCBI help desk responded quickly, was able to replicate the bug, and quickly also addressed the bug.

For others that may have performed similar BLAST searches I paste their response below.

The developers have addressed this issue.

In summary:

• The problem only occurred for BLASTN/megaBLAST searches.

• It only happened if multiple queries were submitted at once. Results for the first query would be correct, but all other searches would use the search space for the first query instead of for each individual query.

• It only affected the web page. Stand-alone BLAST+ does not have this issue.

• Also are you aware of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6662297/ This does have implications for E-values in some situations.

ADD COMMENT
0
Entering edit mode

Thanks for coming back and posting the solution.

I wonder how many blast searches similar to yours were affected over time. At least in your case, blast search result was not wrong. e-values anyway change over time as the underlying database probably changes each day.

ADD REPLY
0
Entering edit mode
19 months ago

This sounds like the problem caused by max_target_sequences described in

Misunderstood parameter of NCBI BLAST

Here is a direct link to the publication

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty833/5106166?redirectedFrom=fulltext

ADD COMMENT
0
Entering edit mode

Thanks for the comment. I wasn't aware of this problem. However, I really don't think this is the issue here.

Here is another search:

First, my query sequence is:

>my_sequence
gacttatcAAAactggcaGGGGGccactgCCCacaggattagcaCCCCCgaggtatgtaATATATATctacagagttcttga

Second, my query sequences are:

>another_sequence
tttccccctggaagctcccAcgtgcgctcGAGAGAGcgaccctgccgcttaccggatacctgtccgcctttctccctt
cgggaagcgtggcgctttctcatagctcacgctgtaggtGGGtcagttcggtgtaggtcgTATATATATcaagctgggctg
>my_sequence
gacttatcAAAactggcaGGGGGccactgCCCacaggattagcaCCCCCgaggtatgtaATATATATctacagagttcttga
>yet_another_sequence
agtggtggcctaactacggctGGGtagaagaacagtatttggtatctgcgctctgGGGgaagccagttaccttcggaaa
aagagttggtagctcttgatccTTTaaacaaaccaccgctggtagcggtggtttttATATATATagcagcagattacg

I don't change any parameters in these searches. The "max_target_sequences" should not apply here since I only get 4 hits for >my_sequence.

The results from the first search:

E-value Ident.  Accession
2e-06   78.05%  DQ977720.1
8e-04   75.61%  DQ977719.1
8e-04   75.61%  DQ977718.1
8e-04   75.61%  AB084167.1

The results from the second search:

E-value Ident.  Accession
4e-06   78.05%  DQ977720.1
0.002   75.61%  DQ977719.1
0.002   75.61%  DQ977718.1
0.002   75.61%  AB084167.1

As you can see, the E-values are completely different :<

Screenshot of the setup for the first search: enter image description here Screenshot of the results for the first search: enter image description here Screenshot of the setup for the second search: enter image description here Screenshot of the results from the second search: enter image description here

ADD REPLY
2
Entering edit mode

If I do this search locally using blast+ v.2.10 and nt I see no difference in the results with >my_sequence alone or in combination with multiple other sequences.

NCBI does things differently with the web interface and this may simply be a result of that. Send a ticket in to NCBI help desk if you want to understand why this is happening.

ADD REPLY
0
Entering edit mode

Thanks. I've sent a ticket to NCBI help desk.

ADD REPLY
0
Entering edit mode

you are correct, the problem I described should not actually modify the E-values, it will only change what is called best hit, for the same hit the E-Value should be the same.

I ran a few tests, I can produce the same inconsistency even with just two sequences and one can trigger it merely by listing one or the other sequence first. All E-values are affected for both sequences. Most unexpected and perhaps incorrect behavior.

Like genomax states, I would send an email to NCBI help desk. Make two files where you list one or the other first to help the process move faster. If you do perhaps you could also let us know what they say.

ADD REPLY
0
Entering edit mode

Thanks. I've sent a ticket to NCBI help desk.

ADD REPLY

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6