Forum: Versioning of Blast DB and Reproducible Research
3
gravatar for makrez
5 months ago by
makrez40
Bern
makrez40 wrote:

Reproducible research demands that we can rerun any analysis and get the same results. But how should we deal with this when analyses depend on databases that are frequently updated?

We are currently facing the problem that we have implemented a daily automatic BLAST db update, but we would like to be able to roll-back the database to any given date.

I can see three ways how this could be achieved:

1) In the best case this would be part of the blast command-line tool suite, for exapmle by specifying with a --date flag up to which date the database should be queried. But to my knowledge, this functionality doesn't exist.

2) An easy solution would be to keep copies of the database. But this is obviously really resource intensive and doesn't seem a good solution, especially if frequent updates occur.

3) So the only reasonable solution would be to have some sort of versioning of the database. This would of course mean that if you want to rerun an analysis, you would have to copy the database first to a certain roll-back point and run the analysis from there.

Do you have experience with this or any suggestions of tools which could provide a sensible solution for versioning the BLAST database? Or am I missing something and this functionality already exists?

Any input or discussion would be appreciated.

ADD COMMENTlink modified 5 months ago by Mensur Dlakic6.9k • written 5 months ago by makrez40
3

makrez : This has come up in the past (NCBI was considering keeping some archival versions available last year, this is probably on back burner now because of SARS-CoV-2). What is the driving use case for this? GenBank is archival so you are more than likely to get a similar answer (plus new sequences that may have appeared).

Because of the size of NCBI databases (nt and nr) this is at best impractical unless your institution has deep pockets to implement a local solution either via storage or at database level.

Edit: As @leipzig points out this question may be purely about a custom internal blast database.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax91k

you're absolutely right, a bit of a tricky situation/issue indeed. But nonetheless happy to hear you care about reproducible science!

what DB are we talking here btw? something custom of something like nr or nt or such ?

ADD REPLYlink modified 5 months ago • written 5 months ago by lieven.sterck8.7k
1

when did the OP mention nt or nr?

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k
1

I was writing the post with regards to the nt database.

ADD REPLYlink written 5 months ago by makrez40
1

In that case you could consider implementing solution @Leipzig proposed below. Be aware that nt (77 GB) and nt (73GB) compressed, are large files. They also can have multiple fasta identifiers pointing to identical sequence. Would be interesting to know if you can make the solution below work.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax91k

Good point. Or even databases from NCBI for that matter. I was assuming that is what they want. Apologies for that.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax91k

OP did not (yet) indeed. I alluded to it, but no reply on it (yet)

ADD REPLYlink modified 5 months ago • written 5 months ago by lieven.sterck8.7k
4
gravatar for Jeremy Leipzig
5 months ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

I think the problem here is that the BLAST database file is monolithic, which makes daily versioning too disk space intensive. The good news is that the FASTA files that are indexed by BLAST are "atomic". So all you really need is a reliable recipe to recreate a database from a subset of date-stamped FASTA files.

I believe if you git tag your Git versioned FASTA directory and git add files daily, it will still not take up an inordinate amount of space. Then you can simply write a script to pull that version and built the database as needed.

What an awesome topic!

ADD COMMENTlink written 5 months ago by Jeremy Leipzig19k

Great suggestion and thanks. This is indeed a very interesting answer and seems feasible to implement. I will seriously look into this and - in case there is some good progress - report back here.

ADD REPLYlink written 5 months ago by makrez40
3
gravatar for lieven.sterck
5 months ago by
lieven.sterck8.7k
VIB, Ghent, Belgium
lieven.sterck8.7k wrote:

as said above, good to hear you care about reproducible science!

However, I feel it's also important to be somewhat pragmatic in this. As also @genomax said above, there are often a whole bunch of technical limitations coming into play here. Along the same line one should also keep track (and keep available) different releases for genome assemblies/annotations (more feasible than keeping all version of nr for instance, but still ... ) .

If you are an author of a manuscript, you could (should?) mention for instance when you downloaded the nr version you used. That way you leave open the possibility to reproduce the analysis, though not straightforward indeed (it is possible to download and filter on release date per entry to mimick the version at a certain date, but again: that's advanced stuff).

If you clearly mention what, how and when you used data or tools, you at least provide the possibility to reproduce it (in contrast to for instance simply mention "we used nr" )

ADD COMMENTlink modified 5 months ago • written 5 months ago by lieven.sterck8.7k
3

reproducibility (of getting accurate results) should be unaffected by database releases. One may not get an identical result but one would not get a wrong result either. Note: As long as the database versions are not years/decades apart.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax91k

Indeed. I'd be worried if a result held up only for a limited number of database versions.

ADD REPLYlink written 5 months ago by Jean-Karim Heriche23k

I partly agree in the sense that the outcomes should not change dramatically, just because of the blast database release. However, from the computer science point of view, these analyses are deterministic and therefore they should be identical. I am also thinking about unit testing in larger pipelines (which would be another great topic to discuss). There, the outcomes should be identical in order to be able to write stable unit tests.

ADD REPLYlink written 5 months ago by makrez40
1

This is the difference between reproducibility and replicability. Interested people can read the recent National Academies report on Reproducibility and Replicability in Science or for a quicker read Reproducibility vs. Replicability: A Brief History of a Confused Terminology

ADD REPLYlink written 5 months ago by Jean-Karim Heriche23k

In that case you should not only keep track of the DB itself but also of the blast binaries of course (much more feasible than the DBs them self though)

ADD REPLYlink written 5 months ago by lieven.sterck8.7k

If you clearly mention what, how and when you used data or tools, you at least provide the possibility to reproduce it (in contrast to for instance simply mention "we used nr" )

Some authors preserve the exact database they used to create manuscript results, and make it available for download. Along with program and/or package versions used, that should be adequate for any future reproducibility or comparative studies.

ADD REPLYlink written 5 months ago by Mensur Dlakic6.9k
2
gravatar for Mensur Dlakic
5 months ago by
Mensur Dlakic6.9k
USA
Mensur Dlakic6.9k wrote:

It depends on whether you want reproducibility for the sake of principle, or if you have a particular goal in mind. Sequence names and IDs don't change, so that part is reproducible. BLAST scoring shouldn't change much either, and that could be an issue when changing program versions rather than updating databases. E-values will continuously change because they are database length-dependent. There is a discussion here how to keep E-values consistent by always specifying the same database size. The only thing that will not be consistent over time is that more sequences will be detected, but isn't identifying new sequences the reason why you are performing daily database updates?

As to the resource-intensive nature of these changes, it depends on how frequently you are doing them. Daily would be too excessive for my needs, but I don't know the nature of your research. If you can dial it down and do weekly or monthly updates, it should not be a problem to keep physical copies of all database versions - disk storage is cheap.

Finally, lowering the redundancy cutoff from 100% to 90% has a huge effect on database size and on subsequent search times, while having a negligible effect on homolog detection - unless you really need to identify all sequences. For example, compressed UniProt at 100% redundancy is 52.3 GB in size, while removing the redundancy at 90% cuts it down to 24.1 GB (~42 GB uncompressed) - see here for details.

ADD COMMENTlink written 5 months ago by Mensur Dlakic6.9k
1

Sequence names and IDs don't change

That's true only to some extent. Names used to change quite often, maybe less frequently now but can be ambiguous. IDs may be more stable but would still need a database version to be completely unambiguous. Some databases consider that a minor change doesn't warrant a new ID.

ADD REPLYlink modified 5 months ago • written 5 months ago by Jean-Karim Heriche23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 718 users visited in the last hour