Hi there folks, I would like to share with you a little utility that I built just a couple of days ago. Its main purpose is to convert a bunch of illumina fastq files into a single sqlite3 database without pain. I am new in bioinformatics and not really sure whether this way of reads processing make sense for biologists, but this code actually helped my friend to build custom processing pipeline on top of sqlite3 database. So any feedback would be appreciated, thank you)
Questions would be summarized and answered right here.
Personally, I think SQL for raw data is kind of a bad idea - its much better suited for processed categorical data and/or highly relational data.
Having said that -- i'm going crazy right now with writing BAM files over and over and over.
Want to mark duplicates on a 100Gb BAM? You'll have to write another 100Gb BAM.
Want to realign indels on a 100Gb BAM? You'll have to write another 100Gb BAM.
And pretty much any other BAM operation that involves modifying the data. I think to follow the GATK best practices on 50 samples, i've written probably a Petabyte of data to disk so far. Obviously i've had to write over old data, because I don't have a petabyte hard drive. The point is if the reads where stored in a structured database, I could modify individual reads without having to copy all the other data too.
Just a thought, for maybe your next project :)
If you are considering doing pileup (wig/bed) data to SQL, I did that about two years ago and the results were pretty lackluster. Here each column is a sample, and each row is a base.
Of course, grabbing signal from a specific region over multiple files is really fast - about 2x faster for 10 columns than from 10 individual bigwigs, and the gains increase the more samples you grab data from at once.
But the size of the data is really significant. In the chart below, the c3 format is essentially a standard constant-step wig (stored in a binary format), and v2 is a variable-step binary format similar to BigWig but without the indexing. As you can see, SQL doesn't really offer a size improvement on constant-step files until you get way up into the 50s - and it will never be anywhere near as small as the variable-step formats.
So what you gain in speed, you loose 10x fold in size. You might be thinking "oh, well thats a fair trade off..?"
Well no, because if you load 50 v2 into memory, they fit just fine - and they can be accessed 1000x faster than SQL.
So the moral of the story is - SQL is great for compatibility, and great for when data is being constantly updated -- but truthfully its the tiny binary formats that win all the races. This is why, after all, FASTQ, BAM, BigWig, etc all exist in the first place.
I don't mean to rain on your parade, but I see two problems:
So kudos to you for writing some bioinformatics code and releasing it publicly! Next time, maybe examine the problem space a bit more before spending too much time on code
Thank you Chris for your feedback, really appreciated. The main purpose of brew is to create backend for a custom filtering - processing, without any existing utilites. Brew yields data in easily accessible form, so you can write cleaner code to manipulate it. The closest analogy I can imagine is RAW format produced by hi-end cameras. This format contains signal from sensor as is in easy-to-manipulate form. About name collisions - you are right, I`ll fix it.
It seems that what your tool does is flattens three files into one like so:
so why not just put this info into a single flat file and have higher performance with both storing and querying it? What benefit does Sqlite offer over this?
Why would you even create three tables with foreing keys etc when all the information would fit into one table?
Heheh, I'm glad someone said it :P :)
Excellent question) Idea was to store lots of fastqs in the same db and have complex cross links between them. Also the idea of SQL VIEW is very promising - instead of having "hardcoded" tables you can have code driven virtual ones - so called VIEWs, it is very convenient way of interpretation and does not consume additional space.
How complex do you expect the connections among FASTQs generated for different experiments are going to be? Views are a nice concept, but what they save in terms of space they consume in terms of compute. Normalization (the RDBMS concept) applied to FASTQ is additional burden leading to very low gains - the RoI is not that good. I'm still trying to understand what need SQL-izing FASTQ could possibly serve.
I also agree with Chris on this that 'brew' is already taken in my brain - why not FASQL or something -_-;
> but I need "no server" solution. And also sqlite`s db can be send to another person or shared somehow.
you're looking for hdf5 https://en.wikipedia.org/wiki/Hierarchical_Data_Format
what kind of analysis requires fastqs in a sqlite3 db ?
Besides you dont need to think about this db as sql db. Connect to it from python with some ORM, like sqlalchemy and you will get object layer mapped to db data.
I've seen a lot of filtering involving BAM files, very few involving FASTQ. SQL is like a medical drug patent - the first pill costs millions of dollars, the second pill costs exponentially less. Or, the SQLite creation is expensive, the first query of any kind is expensive, but as indexes are built, subsequent queries get faster. With well-established FASTQ indexing algorithms, plus the fact that FASTQ is plain text while SQLite is not, I don't see the immediate benefit.
You had a whole lot of great questions and critics that I would like to summarize. So for me the main reason to use database is an attempt to avoid data mapping in application level, try to map forward reads to reverse reads to bar codes in your python, c++ or matlab code in memory. Sqlite in this case is an optimized storage and query engine, it should not know anything about the nature of your data (that is why all files represented in form of the same table), but it must be able to join/filter your data as fast as possible.
Why not single table?
Because you can have arbitrary amount of converted files in your db (from different experiments for example) and you dont even have to add all of them at the same time. Making a single table out off files is equivalent to having certain assumptions about the structure of data, however one of the key ideas was to avoid it. Data interpretation should be done in application level - SQL`s virtual tables known as VIEWS or in terms of object-relation mappers in your favorite programming language.
It is rather fast, it doesn't need server side, it accepts sql as query language, it can be manipulated via ORMs plus you can send or share database with your friend.
Why not flat file?
If you are using flat format - simple text file, you are extremely limited in processing capabilities - to find particular record you would need to lookup the whole file, sqlite uses b-trees and sophisticated caching techniques, so it is way faster.
Renamed to weld.
Why not use up2bit format to store sequence information?
There are two reasons -
What kind of task could be done in sqlite efficiently?
Any kind of filtering by predicate - for example demultiplexing.
Fastq is often gzip'd (several times smaller than your DB) and can be streamed efficiently (we rarely load a whole fastq into RAM). Your DB is mainly useful to pull a few reads by their names. For most other applications, it can't compete streaming on performance and simplicity. I would suggest you learn the current practice before defending your approach. People in this thread all know SQL/SQLite fairly well. We don't use it because it is often not as good as our current approach.
Ok, we appreciate that you are attempting to improve on a process and write a tool to handle FASTQ formats. Don't take the following (and the criticism in general) as discouragement.
It is just that you are solving a problem by first oversimplifying it - hence the many opinions and somewhat of a push-back. For example you say how sqlite is fast and has a query language - well it is fast when it operates on indexed columns, but you are not actually doing that. Your barcode lookup for example is slower than grepping through a file. You could of course add an index but then your single insert speeds would be terrifyingly slow. Now you'd need to solve it as a bulk insert problem. All doable and that would have value. What has less value is thinking that just by puttting some data into a database immediately solves a large suite of problems that people typically encounter.
Now that I looked at it your program it is more of a Illumina read metadata storage system. The tools seems to parse out and store a lot of information from the Illumina read naming scheme itself. That may have more value from some QC point of view perhaps and would be something that most tools don't actually do, hence could be built into something more valuable.
Ok, how to grep forward.fastq, reverse.fastq and barcode.fastq to actually obtain file, containing triplets like forward-reverse-barcode? Show me the code, Sir) or maybe awk or sed can do that somehow, please enlight me, do not hesitate. Performance is not THAT big deal here, just clone the code and try it, the build process is actually take a while but it worth it. The problem of bulk inserts is already solved, see the code. With the weld (now the utility has name weld, yes) you can parse any information declared by standard, but the resulting db is huge. Summary: db structure even without indexes is designed for high performance, at least way higher than grep can offer in tasks where you need not only find something but on top of it map things to each other.
I have not seen people put barcode in a separate file. For forward and reverse reads, they conventionally have the same number of lines: the i-th read in forward.fa is paired with the i-th read in reverse.fa. You just read the two files in parallel. You may also use paste like
paste forward.fa reverse.fa | grep -A3 ^@read_nameor create an interleaved fastq stream. If your forward.fa and reverse.fa are not properly paired, the upstream process is doing something wrong.
Mmmmm... the data I was given for some reason always represented as three separate files - forward, reverse and barcode. Essentially putting them together was the problem. Official documentation claims that output files should have following name pattern
<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz. In my case I got 2 sequence files with 6 barcodes and barcodes themselves in separate file. Should I consult with my data provider?
You only get three files when the experimenter makes use of nonstandard barcodes. In the vast majority of situations when standard barcodes are used the instrument already separates the samples right away into files labeled by the samples (barcodes) and the barcode files themselves are not even distributed to the end user.
When the barcodes are more special and custom made (for example some people use longer or shorter or uneven length barcodes) at the end of the BCL conversion you will get three files - these all should be in the same order.
As Heng has mentioned already the files are always ordered the same way hence getting the Nth pair means simply iterating in parallel on all input files. Your example usage of getting triplets from each file can be done by chaining a few command line tools or when solved via Python in 10 lines of code if you want write it readably (probably a one liner in Perl ;-) )
This discussion though is enlightening in sort of making one think about what could be stored about sequencing reads if these were say longer like the MinION reads where there could be a lot more information about each read. FASTQ databases like this (of course going beyond just simple file storage) are not on our radar when the reads are short and come in immense quantities. Once the meta information in sequencing reads becomes richer and more valuable databases might be viable. I would suggest to look into that. In the case of MinION they even have a HFD format to start with. Amusingly each read is stored as a separate database! It would be nice to be able to query for say high quality reads, or reads over a certain length etc.