Hello everyone,
This is my first time posting here. My name is Pranava, and I wanted to share a small project I have been working on.
The idea is to explore whether DNA sequences can be stored in a more compact and efficient way. I started building a prototype that uses 4-bit storage for bases, which cuts down on space compared to traditional FASTA. On top of that, the format is designed to allow direct access to specific sequences without having to scan the entire file.
What I hope to achieve is a format that could:
- make working with very large sequence collections faster and less
memory-intensive,
- provide a way to stream sequences in batches for machine learning workflows, and
- serve as a more responsive backend for genome browsers and similar tools.
I know there are already established formats like BAM, CRAM, and UCSC 2bit, but I wanted to try building one myself to learn from the process and see if there are new angles worth exploring.
The project is still at an early stage, but here is the repository:
https://github.com/Bit-2310/compact-on-demand-rapid-encoding-of-sequences
I would be very interested to hear your thoughts, feedback, or suggestions on whether something like this could be useful in practice, and what features you would consider important.
Thank you, Pranava
Using four bits for storage won't beat 2bit, of course, but direct access to a sequence of interest sounds interesting and useful and might well offset the cost of added storage. However, searching a 2bit file for a sequence of interest is a so-called "embarrassingly parallel" problem, e.g. split up the search by chromosome, and one that can be made even faster by keeping each array in memory. I'd look at benchmarking your approach against that kind of setup.