Forum:Why does bioinformatics use a special format like a BAM file instead of a database like SQLite?
5
5
Entering edit mode
3.7 years ago
kojix2 ▴ 250

This is a simple question. Of course, there are a few reasons that come to mind immediately.

  • Reduce disk space
  • Improve performance
  • Feed Optimization
  • historical circumstances
  • Obfuscate and add value

The last one is a joke, but why is a special format actually used instead of a common and popular format like SQLite?

sam bam samtools • 3.2k views
ADD COMMENT
6
Entering edit mode
3.7 years ago
h.mon 35k

The paper The Sequence Alignment/Map format and SAMtools and the blog post SAM/BAM/samtools is 10 years old provide a lot of information on why and how the SAM format came about. It seems the main drive was to provide a unified format capable of pleasing everyone, so it would be adopted by all developers of (at the time) short read mappers - even then, it already supported long reads (128 Mbp) from the very first releases. This means the format should be compact, fast, and store a lot of metadata, in addition to record rich alignment information. It was also flexible, and stable.

The point is, as it was designed to store sequence alignment data, it does a better job at it than any generic format adapted to do the same job.

As genomax said, SQLite is a small and self-contained SQL (a programming language) implementation, not a format. And there are several format implementations for relational databases, not just one.

ADD COMMENT
5
Entering edit mode
3.7 years ago
cmdcolin ★ 3.8k

If none of the arguments are convincing, BAM (and CRAM) files also have a nice quality of being remotely accessible in slices (query genomic ranges) via HTTP with zero SQL server setup or REST API. This makes it much easier for users to access their alignment data remotely over ftp and http, users can say samtools view http://mysite.org/myfile.bam chr1:100-2000 or point their genome browser at a remote file and it just works, no REST API needed which would be complicated with by any SQL setup. This is enabled by the BAM index, and the special block based structure of the BAM compression (BGZF block based gzip)

Note that there is also the SAM text representation, and BAM is just an encoding of this. You could store a SAM file in a SQLite database if you want, but you would need a good index. This project came across my feed that does apply genomic range based indexes to sqlite https://github.com/mlin/GenomicSQLite and further things like htsget have put BAM data behind a REST API because there are indeed possible benefits. Just food for thought

ADD COMMENT
4
Entering edit mode
3.7 years ago
GenoMax 141k
  • Portability is important since biologists are end-users for these files.
  • Ability to store metadata about the data in the file.
  • Historical reasons.

SQLite is a relational database management system, not a format correct?

ADD COMMENT
0
Entering edit mode

TBH I think metadata is better supported by a database schema than a goofy header or crammed into the 9th column

ADD REPLY
0
Entering edit mode

Do DBMS allow for column descriptions to be stored in information schema tables?

EDIT: Looks like MySQL allows for <colname> <DATATYPE> COMMENT 'text goes here'. That might be useful.

ADD REPLY
4
Entering edit mode
3.7 years ago

While BAM files might look like they are a delimited table, this is not actaully the case. Its not just a compression of the SAM format, but an entirely different binary format in its own right. That makes it very efficient for storage and look up.

For example, I uploaded the SAM format of a 44Mb BAM file into an SQLite database as a straightforward single table, and the resulting database file was 474Mb.

ADD COMMENT
2
Entering edit mode

To add to this point, BAM/samtools and relational data/SQL have different motivations.

The point of relational data is to maintain relationship and type integrity in entities related to one another. It's a layer on top of the file system to allow for concurrent, consistent and quick access to relational records stored in an underlying file.

The point of BAM files is to conserve space and allow for random access to a complex dataset. There are no relationships implemented as strongly as in RDBMS systems, there are just conventions that tools follow. Unlike RDBMS, this is not an additional layer on a file system but just a file following a specific format. You don't need a special system like SQLite or Oracle to read the underlying file.

ADD REPLY
3
Entering edit mode
3.7 years ago

Traditional databases are not only too slow for BAM files (~8.2B reads per flowcell), they are too damn slow for VCFs (~3.3M variants per individual). You might as well ask why databases are not used audio/video, neuroimaging, GIS rasters, or other low-dimensional, high-volume data. There may be some isolated instances where database I/O could be competitive with files but they never involve the database being used with complex queries.

ADD COMMENT
2
Entering edit mode

Performance issues aside, VCF is at least a naturally relational datatype. Thats why its such a pain to parse: because its really two tables pretending to be one. Really you want one table for positions and information about positions and a separate table for information on genotypes.

ADD REPLY
0
Entering edit mode

There is also a header which also contains unique information.

ADD REPLY
1
Entering edit mode

I think the confusion comes from both the widespread abuse of RDBMS (using it to store structured record-based not-really-relational data) and from seeing the nature of basic bioinformatics datasets, which are structured records until someone starts using the room for flexibility that's built in to them. (Example: VCF is pretty structured until you use some VEP or snpEff or ANNOVAR on it)

RDBMS assume data definitely has well-defined structure and in all probability has relationships, whereas our formats try to get the datasets into as much structure as we can.

ADD REPLY
1
Entering edit mode

I think a variant database or variant warehouse is a far better sustainable analysis solution than a VCF for every person on the planet. Unfortunately, the generic way relational databases are designed makes them too slow for the type of complex queries any analyst would want to perform. Multiple JOINs in particular are still very troublesome in 2020, even with careful indexing. Hail & GenomicDB/TileDB basically mimic the interface we would expect with SQL but with a distributed backend.

ADD REPLY
1
Entering edit mode

What kind of JOINs are you thinking of?

For VCF-type data, can't you usually index by position (meaning they are naturally well-organized)?

ADD REPLY
2
Entering edit mode

"from these 100,000 trios find me genomic intervals in which an autism affected proband has inherited one non-synonymous mutation in a neuro-related gene and an upstream non-coding variant from each unaffected parent"

ADD REPLY
1
Entering edit mode

I think this is precisely the kind of question hail was built to answer. We worked on ~1000 ASD-related samples and plink was a pain in the behind when using such mammoth VCFs. --load-vcf would take forever.

ADD REPLY
0
Entering edit mode

That's a great example. But does the VCF format in itself make this query easier than a relational database?

ADD REPLY
1
Entering edit mode

VCF is a transfer format. it's not meant for querying. There are already good database schemas for storing variants but the backends themselves are just too slow. A query like the one above would probably take hours.

ADD REPLY

Login before adding your answer.

Traffic: 2263 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6