ALAPY Compressor (Update version 1.3 added MacOS X support and 10-20% improved speed and compression ratio)
High throughput lossless genetics data compression tool.
Compress .fastq or .fastq.gz FILE to .ac format or decompress .ac FILE to fastq.gz file with the exact copy of the original fastq file contents. By default, compress or decompress FILE in-place leaving original file intact. By default lossless ALAPY Compressor algorithm is used. You can specify output directory (this directory must exist as the program will not create it). You can also change output file name. By default, the program outputs progress information to stdout but it can be suppressed.
HOW TO GET THE TOOL
To get the latest version please visit http://alapy.com/services/alapy-compressor/ website, scroll down to Download section, select your system by clicking on “for Windows” or “for Unix” (version for Mac OS is coming), read EULA http://alapy.com/alapy-compressor-eula/ , click on the checkbox. The DOWNLOAD button will appear. Click on it to download the tool
Also, all versions of ALAPY compressor available for free on the GitHub https://github.com/ALAPY/alapy_arc and EULA is the same and this is free software tool.
There are paid versions of the compressor with extended functionality and services. Please feel free to ask about them.
- Added MacOS X support (10.12 Sierra and above)
- Optimized compression on "medium" and "fast" levels (now .ac ~10-15% smaller than with 1.2.0 version)
- Added "experimental" option for compression dictionary optimization (-a/--optimize_alphabet), which improves compression speed (up to 20%)
- Optimized error handling
- added compression level option (-l /--level):
- best - best compression ratio, .ac file is1.5-4 times smaller than with gzip, but 4-6 times slower than gzip, requires 1500MB of memory,
- medium - medium level of compression (3-5% bigger .ac file than on best level), 1.2-1.8 slower than gzip, requires 100MB of memory, default
- fast - fastest compression, 0.5-1.4 of gzip speed, .ac file is 4-10% bigger than on best level, equires 100MB of memory.
- Added ability to output results of decompression to stdout (see help for the -d / - decompress optio)
- Added ability to compress data from stdin (see help for the -c / - compress option)
- Changed input validation module for stdin/stdout support
- Improved synchronization of threads (compression speed increased on average by 15%)
- Changed data decompression module (decompression speed on average increased by 30%)
- Optimized intermediate data decompression to write directly to output file or stdout
- Fixed end of line characters handling
- Fixed comparison of reads’ headers and comments
- Initial public beta version
This tool is already compiled for Unix and Windows. Make it executable and put in the PATH or run as is from its directory.
alapy_arc [OPTION] [FILE] [OPTION]...
The options for the program are as follows:
-h --help Print this help and exit -v --version Print the version of the program and exit -c --compress Compress your fastq or fastq.gz file to ALAPY Compression format .ac file. -d --decompress Decompress your .ac file to fastq or fastq.gz file. -o --outdir Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each file is created in the same directory as the file which was processed. If the output file already exists in place the name of the output file will be changed by adding the output file version. -n --name Rename your file after progress -q --quiet Suppress all progress messages on stdout and only report errors.
alapy_arc --compress your_file.fastq.gz --outdir ~/alapy-archive --name renamed_compressed_file --quite
This will compress your_file.fastq.gz to renamed_compressed_file.ac in the alapy-archive directory in your home folder if alapy-archive directory exists. If renamed_compressed_file.ac is already present there, a file with add version will be written to alapy-archive directory
alapy_arc -d your_file.ac
This will decompress your_file.ac (ALAPY Compressor format) into your_file.fastq.gz in the same folder. If file with your_file.fastq.gz name exists already, then file version will be added.
alapy_arc_1.1 -d your_file.fastq.ac - | fastqc /dev/stdin bwa mem reference.fasta <(alapy_arc_1.1 -d your_file_R1.fastq.ac - ) <(alapy_arc_1.1 -d your_file_R2.fastq.ac - ) > your_bwa_mem.SAM
These are examples of piping in general. Note that these are not POSIX and process substitution <(...) is implement in bash, not in sh. Some programs support reading from stdin natively. Read their help and/or manuals. For example FastQC supports it this way:
alapy_arc_1.1 -d your_file.fastq.ac - | fastqc stdin
You may find more about ALAPY Compressor usage on our website http://alapy.com/faq/ (select ALAPY Compressor as a relevant topic.
PIPE-ability std/stdout testing
Now with stdin/stdout support, you can use fastq.ac in your pipes, so there is no need to generate fastq or fastq.gz on your hard drive. You can start with fastq.ac use FastQC, then Trimmomatic, Trimgalore or CutAdapt, then double check with FastQC, use BWA or Bowtie2 in a pipe. This is what we have tested rigorously. Some tools support stdin or - as a parameter, named pipes, process substitution and /dev/stdin are the other ways to use fastq.ac in your pipes. Here is the testing summary where + sign shows support as tested, while - sign shows no high-quality support:
tool subcommand command line stdin /dev/stdin - (as stdin) <(…) process substitution comment fastqc 0.11.5 . "alapy_arc_1.1 -d test.fastq.ac - | fastqc stdin" + + - + recommend by authors fastqc 0.11.5 . "alapy_arc_1.1 -d test.fastq.ac - | fastqc /dev/stdin" + + - + . bwa 0.7.12-5 mem "alapy_arc_1.1 -d test.fastq.ac - | bwa mem hg19.fasta /dev/stdin > aln_se.sam" - + + + . bwa 0.7.12-5 mem "alapy_arc_1.1 -d test.fastq.ac - | bwa mem hg19.fasta - > aln_se_1.sam" - + + + . bwa 0.7.12-5 mem (PE reads) "bwa mem hg19.fasta <(alapy_arc_1.1 -d test_R1.fastq.ac -) <(alapy_arc_1.1 -d test_R2.fastq.ac -) > aln-pe2.sam" - - - + paired end bwa 0.7.12-5 aln "alapy_arc_1.1 -d test.fastq.ac - | bwa aln hg19.fasta /dev/stdin > aln_sa.sai" - + + + . bwa 0.7.12-5 samse "alapy_arc_1.1 -d test.fastq.ac - | bwa samse hg19.fasta aln_sa.sai /dev/stdin > aln-se.sam" - + + + . bwa 0.7.12-5 bwasw "alapy_arc_1.1 -d SRR1769225.fastq.ac - | bwa bwasw hg19.fasta /dev/stdin > bwasw-se.sam" - + + + long reads testing bowtie 1.1.2 . "alapy_arc_1.1 -d test.fastq.ac - | bowtie hs_grch37 /dev/stdin" - + + + . bowtie2 2.2.6-2 . alapy_arc_1.1 -d SRR1769225.fastq.ac - | bowtie2 -x hs_grch37 -U /dev/stdin -S output.sam - + + + . bowtie2 2.2.6-2 (PE reads) "bowtie2 -x ./hs_grch37 -1 <(alapy_arc_1.1 -d ERR1585276_1.fastq.ac -) -2 <(alapy_arc_1.1 -d ERR1585276_2.fastq.ac -) -S out_pe.sam" - - - + paired end trimmomatic 0.35+dfsg-1 . "alapy_arc_1.1 -d test.fastq.ac - | java -jar trimmomatic.jar SE -phred33 /dev/stdin trimmomatic_out.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36" - + - + . cutadapt 1.9.1 . "alapy_arc_1.1 -d test.fastq.ac - | cutadapt -a AACCGGTT -o cutadapt_output.fastq -" - - + + . trimgalore 0.4.4 . "alapy_arc_1.1 -d test.fastq.ac - | trim_galore -a AACCGGTT -o trimgalore_output /dev/stdin" - + + + . bbmap 37.23" . "alapy_arc_1.1 -d test.fastq.ac - | ./bbmap.sh ref=hg19.fasta in=stdin out=mapped.sam usemodulo=t" + - - + .
We tested ALAPY Compressor on 230 diverse set of public fastq files from NCBI SRA. You can read more about it on our website http://alapy.com/services/alapy-compressor/
COMPRESSOR TESTING ON THE BENCHMARK
We observed 1.5 to 3 times compression ratio compared to gziped fastq file (fastq.gz) for the current version of the algorithm. On this figure, you can find results of compression for several representative NGS experiments including WES, WGS, RNA-seq, ChIP-seq, BS-seq using different HiSeqs, NextSeqs, Ion Torrent Proton, AB SOLiD, 4 System, Helicos Heliscope on human, mouse (both on the picture) as well as Arabidopsys, Zebrafish, Medicago, Yeasts and many other model organisms.
Our tool was used on more than 2000 different fastq files and md5 sum before and after compression for fastq files is exactly the same in all cases. We saved several TBs of space for more science. Hurray!
We are working on improving our algorithm, on other file formats support and on a version with a little change in data, that allows a dramatic increase in compression ratio.
Please tell us what you think and how we can make it better.