Why Don'T We Use Binary Format?
5
4
Entering edit mode
8.3 years ago
sutkss ▴ 40

I am studying about DNA compression algorithm because it is expected that the dna data will be enormous and increase. The other day, I get a DNA data in fasta format. The data is about 2GB. But I realized. Fasta format is written 'A','T','G','C' by character code. So it takes 8bit per ATGC. But they can be expressed 2bit by binary code. for instance, A=00, T=01, G=10, C=11 To use binary format will reduce its redundancy and be able to storage much smaller size. I think using binary format is better way than using character code.

Are there special reasons that store using a character code.

fasta • 6.0k views
5
Entering edit mode

So, from looking through the answers, your question is almost a rhetorical (and correct) argument for using binary formats. In fact, binary formats are used in bioinformatics and what you have described is very close to the 2bit format. There are certainly pros and cons, most of which have been described in the answers or the related questions I have linked.

Edit: just one more point against the simplified '2bit' format you describe: A format for storing DNA sequences should support 'N' not to mention other ambiguity codes, because most draft genomes of today contain them. The 'real' 2bit format however can store this information plus additional masking information (http://www.its.caltech.edu/~alok/reviews/blatSpecs.html) via run length encoded meta-data.

1
Entering edit mode

There are plenty of binary formats available. Any decent compression algorithm will generate a binary formatted file. Most people use it for archival purpose mainly. Reference base compression is all the rage these days.

For analysis, it's just simpler to use standard string formats. Being able to manipulate files with simple *nix commands is worth the file size.

0
Entering edit mode

I was always thinking the same, one main problem is I guess for variants like W (A/T) when we go for consensus, variant caller. But I guess if we go to binary, we have to accomodate two different (or three) bases some way.

0
Entering edit mode
10
Entering edit mode
8.3 years ago
matted 7.5k

Unless I misunderstand your point, I'd mostly disagree with your premise. Some binary formats are popular, though text formats are common for cases where performance or space efficiency isn't the primary concern and easy usability (and readability) is.

The 2bit format is a binary format popular for distributing large sets of sequences (see the description at UCSC).

Most large fasta files are distributed in a compressed binary format (e.g. gzip). For example, see the human chromosomes as fa.gz files at UCSC.

1
Entering edit mode

another notable binary format: NCBI ASN1 http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html

8
Entering edit mode
8.3 years ago

There are many binary formats used in bioinformatics.

• BAM is a binary format for alignments
• plink/BED is a binary format for SNP genotypes
• starch is a binary format for genome annotations
• GDS is a binary format, used for example by the SNPRelate CRAN package, which allow to encode 4 SNP genotypes per byte

Since my field of work is human genetics, I am a bit biased toward formats for SNP genotypes and alignments, but I am sure that there are many other formats for other fields.

I agree with you on that these formats are not used as frequently as they should be, because many people prefer to use flat file format. This is mostly because most bioinformaticians are not professional programmers, so they learn using the flat format first. However, there are many options for using binary formats, and in my view, the usage of these formats has increased in the last years.

1
Entering edit mode

Genome Differential Compressor (GDC) is also worth mentioning here: http://bioinformatics.oxfordjournals.org/content/27/21/2979.full

7
Entering edit mode
8.3 years ago

One reason is that binary formats require special tools to read and process, while textual formats are easy to run through a toolbox of common UNIX utilities, readily found on most any Linux or OS X box. Tools to process binary files can be platform-specific and are not often as flexible.

Also, a number of pipelines that manipulate binary data often turn it into some human-readable form, in the end.

For instance, while you might find FASTA in compressed form, no one really enjoys working directly with gzipped-bytes; the data are first extracted to text with gunzip or the like and then manipulated. Or binary BAM data are processed with samtools and results are written to textual output, etc.

2
Entering edit mode
8.3 years ago

I don't really agree. I think a lot of people just keep the .gz and stream them directly to their favorite tools (zcat is our friend). Also, I find that BAM format is more and more popular. And, more generally, there are a lot of initiatives today proposing lossless compression and quickly accessible binary formats, especially for aligned sequences but not only. Here are a couple of fresh ones:

• Bonfield, J. K., & Mahoney, M. V. (2013). Compression of FASTQ and SAM Format Sequencing Data. PLoS ONE, 8(3), e59190. doi:10.1371/journal.pone.0059190 Hach, F. F., Numanagic, I. I.,
• Alkan, C. C., & Sahinalp, S. C. S. (2012). SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics (Oxford, England), 28(23), 3051–3057. doi:10.1093/bioinformatics/bts593
• Jones, D. C., Ruzzo, W. L., Peng, X., & Katze, M. G. (2012). Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research, –. doi:10.1093/nar/gks754
• Popitsch, N., & Haeseler, von, A. (2013). NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research, 41(1), e27. doi:10.1093/nar/gks939
0
Entering edit mode

1000 genomes is providing all our bams in lossy cram

2
Entering edit mode
2.2 years ago

A factor that I haven't seen mentioned regarding why we don't represent sequences using two bits per base is because this would result in a Hamming distance of 1 bit between bases, and since all binary values from 0 to 3 are valid in this scenario, a "bit flip" error like this would go completely undetected. This is actually one of the reasons why ASCII is 7 bits and not 8; it allowed for the 8th bit to be a parity bit, although it wasn't always used that way.

Memory errors like this aren't rare, either. Remember that memory isn't a "0 or 1", it's usually about +5V means 1 and about 0V means 0, but this system is far from perfect. Cosmic rays cause memory errors fairly frequently, and the effect becomes more pronounced at higher altitudes, as there is "less atmosphere" to shield you, if you will. Hardware itself can be unreliable, although it has come a long way. There are mitigating factors, like ECC memory and error-detection techniques, but generally speaking you don't want to just roll the dice like you would be with this system, since you wouldn't be able to tell if the sequence had been modified unless you had something to compare it to. And even then, what if the two sequences are different? How do you know which one is right? Do you get another copy and say best two out of three? Even if that worked, it seems to go against the spirit of optimization in which this question was posed.

Even the 2bit format isn't actually just 2bits; there's a fair amount of metadata in the file. With all that being said, I agree with everyone who has mentioned the fact that manipulating text data is the most convenient. It's the whole reason for base64 encoding, and part of the reason for xml and json.