Question

Forum:Versioning of Blast DB and Reproducible Research

3

Entering edit mode

3.9 years ago

makrez ▴ 50

Reproducible research demands that we can rerun any analysis and get the same results. But how should we deal with this when analyses depend on databases that are frequently updated?

We are currently facing the problem that we have implemented a daily automatic BLAST db update, but we would like to be able to roll-back the database to any given date.

I can see three ways how this could be achieved:

1) In the best case this would be part of the blast command-line tool suite, for exapmle by specifying with a --date flag up to which date the database should be queried. But to my knowledge, this functionality doesn't exist.

2) An easy solution would be to keep copies of the database. But this is obviously really resource intensive and doesn't seem a good solution, especially if frequent updates occur.

3) So the only reasonable solution would be to have some sort of versioning of the database. This would of course mean that if you want to rerun an analysis, you would have to copy the database first to a certain roll-back point and run the analysis from there.

Do you have experience with this or any suggestions of tools which could provide a sensible solution for versioning the BLAST database? Or am I missing something and this functionality already exists?

Any input or discussion would be appreciated.

database blast reproducible-research • 1.5k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 3.9 years ago by makrez ▴ 50

3

Entering edit mode

makrez : This has come up in the past (NCBI was considering keeping some archival versions available last year, this is probably on back burner now because of SARS-CoV-2). What is the driving use case for this? GenBank is archival so you are more than likely to get a similar answer (plus new sequences that may have appeared).

Because of the size of NCBI databases (nt and nr) this is at best impractical unless your institution has deep pockets to implement a local solution either via storage or at database level.

Edit: As @leipzig points out this question may be purely about a custom internal blast database.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

you're absolutely right, a bit of a tricky situation/issue indeed. But nonetheless happy to hear you care about reproducible science!

what DB are we talking here btw? something custom of something like nr or nt or such ?

ADD REPLY • link 3.9 years ago by lieven.sterck 15k

1

Entering edit mode

when did the OP mention nt or nr?

ADD REPLY • link 3.9 years ago by Jeremy Leipzig 22k

1

Entering edit mode

I was writing the post with regards to the nt database.

ADD REPLY • link 3.9 years ago by makrez ▴ 50

1

Entering edit mode

In that case you could consider implementing solution @Leipzig proposed below. Be aware that nt (77 GB) and nt (73GB) compressed, are large files. They also can have multiple fasta identifiers pointing to identical sequence. Would be interesting to know if you can make the solution below work.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Good point. Or even databases from NCBI for that matter. I was assuming that is what they want. Apologies for that.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

OP did not (yet) indeed. I alluded to it, but no reply on it (yet)

ADD REPLY • link 3.9 years ago by lieven.sterck 15k

score 4 · Answer 1 · 2020-05-22

4

Entering edit mode

3.9 years ago

Jeremy Leipzig 22k

I think the problem here is that the BLAST database file is monolithic, which makes daily versioning too disk space intensive. The good news is that the FASTA files that are indexed by BLAST are "atomic". So all you really need is a reliable recipe to recreate a database from a subset of date-stamped FASTA files.

I believe if you git tag your Git versioned FASTA directory and git add files daily, it will still not take up an inordinate amount of space. Then you can simply write a script to pull that version and built the database as needed.

What an awesome topic!

ADD COMMENT • link 3.9 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Great suggestion and thanks. This is indeed a very interesting answer and seems feasible to implement. I will seriously look into this and - in case there is some good progress - report back here.

ADD REPLY • link 3.9 years ago by makrez ▴ 50

score 3 · Answer 2 · 2020-05-22

3

Entering edit mode

3.9 years ago

lieven.sterck 15k

as said above, good to hear you care about reproducible science!

However, I feel it's also important to be somewhat pragmatic in this. As also @genomax said above, there are often a whole bunch of technical limitations coming into play here. Along the same line one should also keep track (and keep available) different releases for genome assemblies/annotations (more feasible than keeping all version of nr for instance, but still ... ) .

If you are an author of a manuscript, you could (should?) mention for instance when you downloaded the nr version you used. That way you leave open the possibility to reproduce the analysis, though not straightforward indeed (it is possible to download and filter on release date per entry to mimick the version at a certain date, but again: that's advanced stuff).

If you clearly mention what, how and when you used data or tools, you at least provide the possibility to reproduce it (in contrast to for instance simply mention "we used nr" )

ADD COMMENT • link 3.9 years ago by lieven.sterck 15k

3

Entering edit mode

reproducibility (of getting accurate results) should be unaffected by database releases. One may not get an identical result but one would not get a wrong result either. Note: As long as the database versions are not years/decades apart.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Indeed. I'd be worried if a result held up only for a limited number of database versions.

ADD REPLY • link 3.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I partly agree in the sense that the outcomes should not change dramatically, just because of the blast database release. However, from the computer science point of view, these analyses are deterministic and therefore they should be identical. I am also thinking about unit testing in larger pipelines (which would be another great topic to discuss). There, the outcomes should be identical in order to be able to write stable unit tests.

ADD REPLY • link 3.9 years ago by makrez ▴ 50

1

Entering edit mode

This is the difference between reproducibility and replicability. Interested people can read the recent National Academies report on Reproducibility and Replicability in Science or for a quicker read Reproducibility vs. Replicability: A Brief History of a Confused Terminology

ADD REPLY • link 3.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

In that case you should not only keep track of the DB itself but also of the blast binaries of course (much more feasible than the DBs them self though)

ADD REPLY • link 3.9 years ago by lieven.sterck 15k

0

Entering edit mode

If you clearly mention what, how and when you used data or tools, you at least provide the possibility to reproduce it (in contrast to for instance simply mention "we used nr" )

Some authors preserve the exact database they used to create manuscript results, and make it available for download. Along with program and/or package versions used, that should be adequate for any future reproducibility or comparative studies.

ADD REPLY • link 3.9 years ago by Mensur Dlakic ★ 27k

score 2 · Answer 3 · 2020-05-22

It depends on whether you want reproducibility for the sake of principle, or if you have a particular goal in mind. Sequence names and IDs don't change, so that part is reproducible. BLAST scoring shouldn't change much either, and that could be an issue when changing program versions rather than updating databases. E-values will continuously change because they are database length-dependent. There is a discussion here how to keep E-values consistent by always specifying the same database size. The only thing that will not be consistent over time is that more sequences will be detected, but isn't identifying new sequences the reason why you are performing daily database updates?

As to the resource-intensive nature of these changes, it depends on how frequently you are doing them. Daily would be too excessive for my needs, but I don't know the nature of your research. If you can dial it down and do weekly or monthly updates, it should not be a problem to keep physical copies of all database versions - disk storage is cheap.

Finally, lowering the redundancy cutoff from 100% to 90% has a huge effect on database size and on subsequent search times, while having a negligible effect on homolog detection - unless you really need to identify all sequences. For example, compressed UniProt at 100% redundancy is 52.3 GB in size, while removing the redundancy at 90% cuts it down to 24.1 GB (~42 GB uncompressed) - see here for details.