Hi there folks, I would like to share with you a little utility that I built just a couple of days ago. Its main purpose is to convert a bunch of illumina fastq files into a single sqlite3 database without pain. I am new in bioinformatics and not really sure whether this way of reads processing make sense for biologists, but this code actually helped my friend to build custom processing pipeline on top of sqlite3 database. So any feedback would be appreciated, thank you)
Questions would be summarized and answered right here.
Personally, I think SQL for raw data is kind of a bad idea - its much better suited for processed categorical data and/or highly relational data.
Having said that -- i'm going crazy right now with writing BAM files over and over and over.
Want to mark duplicates on a 100Gb BAM? You'll have to write another 100Gb BAM.
Want to realign indels on a 100Gb BAM? You'll have to write another 100Gb BAM.
And pretty much any other BAM operation that involves modifying the data. I think to follow the GATK best practices on 50 samples, i've written probably a Petabyte of data to disk so far. Obviously i've had to write over old data, because I don't have a petabyte hard drive. The point is if the reads where stored in a structured database, I could modify individual reads without having to copy all the other data too.
Just a thought, for maybe your next project :)
If you are considering doing pileup (wig/bed) data to SQL, I did that about two years ago and the results were pretty lackluster. Here each column is a sample, and each row is a base.
Of course, grabbing signal from a specific region over multiple files is really fast - about 2x faster for 10 columns than from 10 individual bigwigs, and the gains increase the more samples you grab data from at once.
But the size of the data is really significant. In the chart below, the c3 format is essentially a standard constant-step wig (stored in a binary format), and v2 is a variable-step binary format similar to BigWig but without the indexing. As you can see, SQL doesn't really offer a size improvement on constant-step files until you get way up into the 50s - and it will never be anywhere near as small as the variable-step formats.
So what you gain in speed, you loose 10x fold in size. You might be thinking "oh, well thats a fair trade off..?"
Well no, because if you load 50 v2 into memory, they fit just fine - and they can be accessed 1000x faster than SQL.
So the moral of the story is - SQL is great for compatibility, and great for when data is being constantly updated -- but truthfully its the tiny binary formats that win all the races. This is why, after all, FASTQ, BAM, BigWig, etc all exist in the first place.