Question

Forum:Tried building a compact sequence format with 4-bit storage

2

Entering edit mode

9 weeks ago

Pranava ▴ 30

Hello everyone,

This is my first time posting here. My name is Pranava, and I wanted to share a small project I have been working on.

The idea is to explore whether DNA sequences can be stored in a more compact and efficient way. I started building a prototype that uses 4-bit storage for bases, which cuts down on space compared to traditional FASTA. On top of that, the format is designed to allow direct access to specific sequences without having to scan the entire file.

What I hope to achieve is a format that could:

make working with very large sequence collections faster and less memory-intensive,
- provide a way to stream sequences in batches for machine learning workflows, and
- serve as a more responsive backend for genome browsers and similar tools.

I know there are already established formats like BAM, CRAM, and UCSC 2bit, but I wanted to try building one myself to learn from the process and see if there are new angles worth exploring.

The project is still at an early stage, but here is the repository: https://github.com/Bit-2310/compact-on-demand-rapid-encoding-of-sequences

I would be very interested to hear your thoughts, feedback, or suggestions on whether something like this could be useful in practice, and what features you would consider important.

Thank you, Pranava

fasta open-source file-formats data-storage bioinformatics • 11k views

ADD COMMENT • link updated 8 weeks ago by Alex Reynolds 36k • written 9 weeks ago by Pranava ▴ 30

1

Entering edit mode

Using four bits for storage won't beat 2bit, of course, but direct access to a sequence of interest sounds interesting and useful and might well offset the cost of added storage. However, searching a 2bit file for a sequence of interest is a so-called "embarrassingly parallel" problem, e.g. split up the search by chromosome, and one that can be made even faster by keeping each array in memory. I'd look at benchmarking your approach against that kind of setup.

ADD REPLY • link 9 weeks ago by Alex Reynolds 36k

0

Entering edit mode

I see, but the reason I was planning on using a 4bit system is so that there wouldn't be too much data loss but thank you for the suggestion, I will need to look more into this.

I recently ran my first benchmark and the file size wise it was decent compared to .fasta format but it didn't really hold too great against the compressed formats

ADD REPLY • link 9 weeks ago by Pranava ▴ 30

0

Entering edit mode

In addition to looking at other tools as a baseline: https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data maybe also look into how your format supports functionality or queries that existing formats do not, or where your format would be faster or use less memory. You might look at comparing your tool against existing software used for sequence alignment, for instance.

ADD REPLY • link 8 weeks ago by Alex Reynolds 36k

score 2 · Answer 1 · 2025-09-19

Hello Pranava,

Thanks for sharing your repository. Devising a new file format is an interesting undertaking and certainly a task with many devils in the details.

A few years ago, I had dreamed of writing an efficient converter to simplify the adoption of some of the existing compression formats, but ultimately this did go nowhere. But therefore, I had looked into existing formats and approaches, so I can share a few links for stuff that was already around back then:

You would probably need to benchmark your format against other formats like SRA, NAF, SFQ. Some relevant publications are e.g. those for LFastqC and using FPGA-accelerated LZMA for compression.

For working with genomic Data in Python, BioNumpy is neat, because it encodes genomic data as numeric arrays and processes it fast with Numpy. More recently a similar library that builds on Polars was released. Both have efficient internal representations of genomic data although not aiming to create a new file format. But they can certainly serve as inspiration, also with regard to the API.

Good luck and let us know when you think it is ready to be tested!