I suspect that greater entropy or "randomness" in most genomic sequences — as well as a Huffman table often using more bits per character than 2bit — is why gzip could perform worse than 2bit.
To explain, the 2bit format stores 4 bits for each base: two bits for the base and another two to classify the type of base (intron, exon or unknown). That is in addition to some minor overhead for metadata for the file and for each FASTA record.
The gzip DEFLATE procedure generally uses a mix of Huffman coding, which builds a variable-bit-length dictionary from raw input processed with LZ77, which looks for and encodes repetitive regions.
It seems reasonable to believe that 2bit would be more efficient with genomic regions containing "noisy" or "random" sequence data, i.e., a good portion of a given genome, I'd wager. With 2bit, every base in a FASTA record's sequence data gets compressed to the same four bits, regardless of the overall information content in the sequence being squeezed.
Some special areas of the genome can be redundant - transcription factor motifs, repeats like satellite DNA, etc. - and will allow LZ77 to perform better, being able to squeeze those redundancies to a greater degree.
Once redundant data are compressed with LZ77, those data are reduced further with a Huffman encoding table, which I think could often use more than four bits per table entry, because there are at least ten bases (actgACTGnN
), lexicographical characters for FASTA headers, and special characters used in the LZ77 encoding that describe the redundancies.
So, in sum, 2bit always uses four bits per base (plus minor overhead), regardless of whether the sequence data is repetitive or not. As there is probably generally less redundancy in most genomic data to take advantage of, the gzip procedure likely performs worse, for typical, average inputs.
To answer your last question, it is difficult to compress a file with a series of two or more modern algorithms and expect better results. As you compress data, you lose bits that any applied algorithm can take out — that's the definition of compression, after all!
Otherwise, you could just apply a magical compression algorithm repeatedly to the same data, in series, and reduce the original input by some exponential power. With real world compression algorithms, at some point, you can't take any more squished stuff out and still be able to get your old, un-squished stuff back.
What you could do, I think, is take a 2bit representation of what you know in advance is highly redundant DNA sequence data, squeezing that with a compression tool that is designed to take advantage of redundant input, like gzip or bzip2. You might be able to squeeze that a bit more.
Why dont you try to gzip/bzip a 2bit file and tell us the resulting sizes :o)
Hmm, well I didn't have any on hand and didn't know where to get one, but then I remembered seeing something about 2bit files on UCSC. So I grabbed hg19.2bit, which is 779Mb. Not surprising, since I know it includes header data to deal with things like N's and other non-ATGC data. And the gzipped version is 676Mb, only 90% of the theoretical 753Mb 1/4 FASTA file. So at least one thing works out like I expected.
hg19 also contains many "N" plus the sequence headers (e.g. ">chr1"). I don't know what effect they have on compression efficiency.
From what I gather about DEFLATE, it seems the code table can adapt to local abundances of different strings, just like LZ78. So the runs of millions of N's I've seen in hg19 should compress very efficiently, but once it exits those regions, I'd expect it to quickly rebuild a code table appropriate to "normal" sequence. Same for areas following header lines. This is what I'd expect, but as you can see, my expectations have not been correct, which is the perplexing part of this puzzle.