As mentioned in other answers, BAM only fulfills part of your requirement (compression and random access), but not indexing. However, you can easily roll your own index using the BGZF API and your key-value store of choice e.g. Berkeley DB.
Here's an example using my cl-sam API, but you can substitute e.g. the C API that comes with Samtools or the Java API from Picard (or a Swig wrapper). I'm just writing data to a text file, instead of BDB, but you get the idea...
(defun bamdex (bam-file index-file)
(with-open-file (index index-file :direction :output)
(with-bgzf (bgzf bam-file :direction :input)
for offset = (bgzf-tell bgzf)
for record = (read-alignment bgzf)
do (format index "~s ~d~%" (read-name record) offset)))))
Keys and values:
The key is the read name, the number is the virtual offset into the uncompressed data (see the SAM spec). Use with a BGZF seek to reach the record:
(with-bgzf (bgzf "test.bam")
(bgzf-seek bgzf 2269928771)
(read-name (read-alignment bgzf)))
If the reads are very long, you might consider using a sequence checksum as a key instead of the actual sequence.
Good info. I know samtools has a "TO DO" list, but I wonder if anyone has already done an implementation of an indexed BAM file...
Might be added to Biopython shortly, see here