Question

Old BLAST data

1

Entering edit mode

9.5 years ago

Aerval ▴ 290

Hi,

I am currently developing a tool that compares the result of one BLAST search against different versions of a database to explain why some conclusion/finding comes up at one point but not at another.

For this I am obiviously in need of older BLAST results. Does anybody of you know a place (or have some) where I can find older BLAST search result xml files or older versions of useable nucleotide or protein databases?

I not sure whether one can reverse calculate the state of current NCBI databases but if that would be possible I would be grateful for any suggestions in that too.

Thank you

db xml blast • 2.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Aerval ▴ 290

Ram · Answer 1 · 2014-11-05

2

Entering edit mode

9.5 years ago

onuralp ▴ 190

I'm afraid the answer is no. However, here is a neat idea to get around: Are old versions of NCBI's nr stored somewhere?

Not sure what exactly you have in mind, but I have recently come across an interesting paper documenting phylostratigraphic bias (which may be good to keep in mind)

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by onuralp ▴ 190

Ram · Answer 2 · 2014-11-05

1

Entering edit mode

9.5 years ago

pld 5.1k

One problem you may face is when sequences/genomes are updated. I'm not sure if you'll be able to find old versions of sequences.

What exactly are you interested in measuring? Are you interested in how the performance of BLAST over time has varied? Are you interested in how the evolution of the available sequence data over time has impacted BLAST? Or, are you interested in how the search space impacts the performance of BLAST?

If you're interested in how search space impacts BLAST performance, why not generate BLAST databases from random subsets of the current nr database (Using the alias tool like Istvan suggested). You could then add that include a known true positive(s) and measure performance. There are lots of strategies for this so it depends on what you're looking for.

Out of curiosity, what do you mean by some conclusions appearing at some times but not at others? Can you give an example? It is possible that BLAST may miss something, but that is very unlikely. Are you sure that it isn't just the fact that one result may have been added or removed from nr at that time? E.g. hit XXX didn't show up before 2009 because XXX wasn't in nr before then.

Another option might be to use old versions of Ensembl. I'm not sure that it is as well curated as nr, but old versions of Ensembl are available.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by pld 5.1k

0

Entering edit mode

Your fourth paragraph describes it very well:

We are interested in the results of small changes in the database like description changes, updated or retracted sequences etc. that lead to a different (or assumed different like when only the decription or even the e-Value changes) BLAST result to determine why a particular conclusion could be made from one result but not from another. In the end this should allow us to define which hits are really new, which are just updated or modified and which results from the previous BLAST search are to be seen with caution because they have been retracted. This can mostly be done by checking the submission date but we wanted to also include those results that would not show up via submission alone.

Edit: Embl databases might be a got idea.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Aerval ▴ 290

0

Entering edit mode

What I forgot: As the database becomes bigger, the e-values rise and therefore might lead to skipping of a particular hit because of a custom set threshold. We wanted to check these results for that they are still avaiable and not retracted but just more unlikely

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Aerval ▴ 290

1

Entering edit mode

This is an interesting idea. Please keep us posted on what you find out. It is possible that some type of sequence based analyses and inferences (e.g., phylostratigraphy) is more prone to be biased by the varying threshold than others (arguably those using, say, bi-directional best hit approaches).

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by onuralp ▴ 190

0

Entering edit mode

What exactly do you mean by the description? Do you mean that the resulting alignment of the query and subject changes, or do you mean the description entries for the RefSeq item itself?

You may be able to save some time and see how the methods used to calculate the lambda and K parameters for the E-value equation are impacted by this. I seem to remember some parameters for BLAST being dependent on the current server load, but I can't remember.

ADD REPLY • link 9.5 years ago by pld 5.1k

0

Entering edit mode

With description I meant the sequence definition that can be updated (like 'Eschrichia ssp. Protein X' becoming 'E. coli Cytochrome C 13b') with a change in the gene ID or any other notable changes to the sequence.

That the e-Value changes with size of the db etc. is expected

ADD REPLY • link 9.5 years ago by Aerval ▴ 290

Ram · Answer 3 · 2014-11-05

0

Entering edit mode

9.5 years ago

Istvan Albert 100k

Another possibility would be to filter gene ids by their dates of submission or publication then use blastalias to create a subset of data that reflects the state of the blast database at a different point in time.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

That is possible but we also wanted to include changes that are caused by new submissions like id updates, deletions etc.

ADD REPLY • link 9.5 years ago by Aerval ▴ 290