Tool: Lossless ALAPY Fastq Compressor (now for MacOS X with 10-20% improved speed and compression ratio)
11
gravatar for Petr Ponomarenko
4 months ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.4k wrote:

ALAPY Compressor (Update version 1.3 added MacOS X support and 10-20% improved speed and compression ratio)

High throughput lossless genetics data compression tool.

Compress .fastq or .fastq.gz FILE to .ac format or decompress .ac FILE to fastq.gz file with the exact copy of the original fastq file contents. By default, compress or decompress FILE in-place leaving original file intact. By default lossless ALAPY Compressor algorithm is used. You can specify output directory (this directory must exist as the program will not create it). You can also change output file name. By default, the program outputs progress information to stdout but it can be suppressed.

HOW TO GET THE TOOL

To get the latest version please visit http://alapy.com/services/alapy-compressor/ website, scroll down to Download section, select your system by clicking on “for Windows” or “for Unix” (version for Mac OS is coming), read EULA http://alapy.com/alapy-compressor-eula/ , click on the checkbox. The DOWNLOAD button will appear. Click on it to download the tool

Also, all versions of ALAPY compressor available for free on the GitHub https://github.com/ALAPY/alapy_arc and EULA is the same and this is free software tool.

There are paid versions of the compressor with extended functionality and services. Please feel free to ask about them.

VERSIONS

Version 1.3.0

  • Added MacOS X support (10.12 Sierra and above)
  • Optimized compression on "medium" and "fast" levels (now .ac ~10-15% smaller than with 1.2.0 version)
  • Added "experimental" option for compression dictionary optimization (-a/--optimize_alphabet), which improves compression speed (up to 20%)
  • Optimized error handling

Version 1.2:

  • added compression level option (-l /--level):
    • best - best compression ratio, .ac file is1.5-4 times smaller than with gzip, but 4-6 times slower than gzip, requires 1500MB of memory,
    • medium - medium level of compression (3-5% bigger .ac file than on best level), 1.2-1.8 slower than gzip, requires 100MB of memory, default
    • fast - fastest compression, 0.5-1.4 of gzip speed, .ac file is 4-10% bigger than on best level, equires 100MB of memory.

Version 1.1:

  • Added ability to output results of decompression to stdout (see help for the -d / - decompress optio)
  • Added ability to compress data from stdin (see help for the -c / - compress option)
  • Changed input validation module for stdin/stdout support
  • Improved synchronization of threads (compression speed increased on average by 15%)
  • Changed data decompression module (decompression speed on average increased by 30%)
  • Optimized intermediate data decompression to write directly to output file or stdout
  • Fixed end of line characters handling
  • Fixed comparison of reads’ headers and comments

Version 0.0:

  • Initial public beta version

INSTALLATION

This tool is already compiled for Unix and Windows. Make it executable and put in the PATH or run as is from its directory.

USAGE

alapy_arc [OPTION] [FILE] [OPTION]...

OPTIONS

The options for the program are as follows:

-h --help      
Print this help and exit

-v --version    
Print the version of the program and exit

-c --compress       
Compress your fastq or fastq.gz file to ALAPY Compression format .ac file.      

-d --decompress   
Decompress your .ac file to fastq or fastq.gz file.

-o --outdir    
Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each file is created in the same directory as the file which was processed. If the output file already exists in place the name of the output file will be changed by adding the output file version.

-n --name              
Rename your file after progress 

-q --quiet       
Suppress all progress messages on stdout and only report errors.

EXAMPLES

alapy_arc --compress your_file.fastq.gz --outdir ~/alapy-archive --name renamed_compressed_file --quite

This will compress your_file.fastq.gz to renamed_compressed_file.ac in the alapy-archive directory in your home folder if alapy-archive directory exists. If renamed_compressed_file.ac is already present there, a file with add version will be written to alapy-archive directory

alapy_arc -d your_file.ac

This will decompress your_file.ac (ALAPY Compressor format) into your_file.fastq.gz in the same folder. If file with your_file.fastq.gz name exists already, then file version will be added.

alapy_arc_1.1 -d your_file.fastq.ac - | fastqc /dev/stdin
bwa mem reference.fasta <(alapy_arc_1.1 -d your_file_R1.fastq.ac - ) <(alapy_arc_1.1 -d your_file_R2.fastq.ac - ) > your_bwa_mem.SAM

These are examples of piping in general. Note that these are not POSIX and process substitution <(...) is implement in bash, not in sh. Some programs support reading from stdin natively. Read their help and/or manuals. For example FastQC supports it this way:

alapy_arc_1.1 -d your_file.fastq.ac - | fastqc stdin

You may find more about ALAPY Compressor usage on our website http://alapy.com/faq/ (select ALAPY Compressor as a relevant topic.

PIPE-ability std/stdout testing

Now with stdin/stdout support, you can use fastq.ac in your pipes, so there is no need to generate fastq or fastq.gz on your hard drive. You can start with fastq.ac use FastQC, then Trimmomatic, Trimgalore or CutAdapt, then double check with FastQC, use BWA or Bowtie2 in a pipe. This is what we have tested rigorously. Some tools support stdin or - as a parameter, named pipes, process substitution and /dev/stdin are the other ways to use fastq.ac in your pipes. Here is the testing summary where + sign shows support as tested, while - sign shows no high-quality support:

tool    subcommand  command line    stdin   /dev/stdin  - (as stdin)    <(…) process substitution   comment
fastqc 0.11.5   .   "alapy_arc_1.1 -d test.fastq.ac - | fastqc stdin"   +   +   -   +   recommend by authors
fastqc 0.11.5   .   "alapy_arc_1.1 -d test.fastq.ac - | fastqc /dev/stdin"  +   +   -   +   .
bwa 0.7.12-5    mem "alapy_arc_1.1 -d test.fastq.ac - | bwa mem hg19.fasta /dev/stdin > aln_se.sam" -   +   +   +   .
bwa 0.7.12-5    mem "alapy_arc_1.1 -d test.fastq.ac - | bwa mem hg19.fasta - > aln_se_1.sam"    -   +   +   +   .
bwa 0.7.12-5    mem (PE reads)  "bwa mem hg19.fasta <(alapy_arc_1.1 -d test_R1.fastq.ac -) <(alapy_arc_1.1 -d test_R2.fastq.ac -) > aln-pe2.sam"    -   -   -   +   paired end
bwa 0.7.12-5    aln "alapy_arc_1.1 -d test.fastq.ac - | bwa aln hg19.fasta /dev/stdin > aln_sa.sai" -   +   +   +   .
bwa 0.7.12-5    samse   "alapy_arc_1.1 -d test.fastq.ac - | bwa samse hg19.fasta aln_sa.sai /dev/stdin > aln-se.sam"    -   +   +   +   .
bwa 0.7.12-5    bwasw   "alapy_arc_1.1 -d SRR1769225.fastq.ac - | bwa bwasw hg19.fasta /dev/stdin > bwasw-se.sam"   -   +   +   +   long reads testing
bowtie 1.1.2    .   "alapy_arc_1.1 -d test.fastq.ac - | bowtie hs_grch37 /dev/stdin"    -   +   +   +   .
bowtie2 2.2.6-2 .   alapy_arc_1.1 -d SRR1769225.fastq.ac - | bowtie2 -x hs_grch37 -U /dev/stdin -S output.sam   -   +   +   +   .
bowtie2 2.2.6-2 (PE reads)  "bowtie2 -x ./hs_grch37 -1 <(alapy_arc_1.1 -d ERR1585276_1.fastq.ac -) -2 <(alapy_arc_1.1 -d ERR1585276_2.fastq.ac -) -S out_pe.sam"    -   -   -   +   paired end
trimmomatic 0.35+dfsg-1 .   "alapy_arc_1.1 -d test.fastq.ac - | java -jar trimmomatic.jar SE -phred33 /dev/stdin trimmomatic_out.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"   -   +   -   +   .
cutadapt 1.9.1  .   "alapy_arc_1.1 -d test.fastq.ac - | cutadapt -a AACCGGTT -o cutadapt_output.fastq -"    -   -   +   +   .
trimgalore 0.4.4    .   "alapy_arc_1.1 -d test.fastq.ac - | trim_galore -a AACCGGTT -o trimgalore_output /dev/stdin"    -   +   +   +   .
bbmap
37.23"  .   "alapy_arc_1.1 -d test.fastq.ac - | ./bbmap.sh ref=hg19.fasta in=stdin out=mapped.sam usemodulo=t"  +   -   -   +   .

BENCHMARK

We tested ALAPY Compressor on 230 diverse set of public fastq files from NCBI SRA. You can read more about it on our website http://alapy.com/services/alapy-compressor/

COMPRESSOR TESTING ON THE BENCHMARK

We observed 1.5 to 3 times compression ratio compared to gziped fastq file (fastq.gz) for the current version of the algorithm. On this figure, you can find results of compression for several representative NGS experiments including WES, WGS, RNA-seq, ChIP-seq, BS-seq using different HiSeqs, NextSeqs, Ion Torrent Proton, AB SOLiD, 4 System, Helicos Heliscope on human, mouse (both on the picture) as well as Arabidopsys, Zebrafish, Medicago, Yeasts and many other model organisms.

enter image description here

USAGE CASES

Our tool was used on more than 2000 different fastq files and md5 sum before and after compression for fastq files is exactly the same in all cases. We saved several TBs of space for more science. Hurray!

FUTURE WORK

We are working on improving our algorithm, on other file formats support and on a version with a little change in data, that allows a dramatic increase in compression ratio.

Please tell us what you think and how we can make it better.

Thank you,

Petr

ADD COMMENTlink modified 27 days ago • written 4 months ago by Petr Ponomarenko2.4k
4

Can you include uQ and Clumpify in comparison along with other tools mentioned in uQ thread?

ADD REPLYlink written 4 months ago by genomax32k
2

Sure, genomax2. Thank you for your interest. We are planning to write a nice simple paper using our benchmark (that we will also improve a lot). We will try to provide tools for benchmark downloading and testing to the community as well. We already tested tons of tools, but on much smaller "test" benchmark. I hope we will get a few more tools to test in this thread or ideas of what and how we should test.

If you or anybody here on Biostars is interested in such a benchmark, on compression tools or in studying this topic together with, this is great! We can also write a paper together =)

Thank you (love Biostars and this community)

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
3

Hi Petr,

Can I request that you add compression support for stdout? It looks like currently decompression only supports stdout, and compression only stdin, but it would be really convenient to have a "--stdout" flag like gzip/pigz to force all output to be written to stdout. Also, while it is possible to suppress the output messages with -q, for a tool supporting primary output to stdout it might make sense to have verbose messaging directed to stderr instead so that they don't mix.

ADD REPLYlink written 3 months ago by Brian Bushnell13k
1

Sure, thank you for your request, Brian. I will write here when functionality you ask about is ready for testing and for public use. Also, it appears that there is no direct messaging on Biostars. If you want, you can send me an email to my personal mailbox pon dot petr at gmail dot com , so I can notify you personally when we have stdout for compression files and stdin for decompression with verbose messaging redirected to stderr for -q option.

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
2

Can you please also add xz to the comparison. xz -9 or maybe only xz -3, as it takes less time.

ADD REPLYlink modified 4 months ago • written 4 months ago by Deepak Tanwar3.7k
1

Thank you, Deepak Tanwar. We will include xz in the comparison as well. So far we just wanted to begin the conversation about NGS data compression with biostars community. If you or anybody here know other tools you wish we had tested on our benchmark, please tell us. Thank you. Petr

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

Dear Deepak Tanwar

We had run a comparison of compression ratio, compression time and decompression time for xz. I know you are waiting for -3 and -9 results. While a big reply with benchmark explanation and software testing is in the works, we decided to start to publish some preliminary data. Results of xz testing are very interesting for us, so here is a sneak peak at the same samples you we used for the figure enter image description here

Results are as follows for xz combined with gzip and both are with -6 parameter:

sra id  strategy/platform/Layout/coverage   ac_size xz_size xz/ac_size  ac_compression  xz_compression  xz/ac_compression   ac_decompression    xz_decompression    xz/ac_decompression
SRR4420255  ChIP-Seq/ NextSeq 500/PE/0.86x  1033493504  1395233200  1,35    0:15:17 1:55:40 7,57    0:22:51 0:08:26 0,37
SRR3733117  WES/Ion Torrent Proton/PE/0.01x 126371328   196094396   1,55    0:01:26 0:10:06 7,05    0:01:43 0:00:41 0,40
ERR1837133  WGS/ HiSeq 2500/PE/413x 165825536   225350136   1,36    0:03:32 0:16:07 4,56    0:04:10 0:01:16 0,30
SRR3169850  WGS/Helicos HeliScope/SE/0.07x  55185920    106048864   1,92    0:01:18 0:07:10 5,51    0:01:34 0:00:36 0,38
SRR1609039  Bisulfite-seq/ HiSeq 2500/SE/0.18x  227822592   348529068   1,53    0:04:08 0:26:04 6,31    0:04:14 0:01:21 0,32
ERR405330   ChIP-Seq/ HiSeq 2000/SE/0.04x   110747136   171723420   1,55    0:01:58 0:13:11 6,70    0:02:37 0:00:59 0,38
SRR1769225  Bisulfite-seq/ HiSeq 2000/PE/0x 39052800    50910456    1,30    0:00:26 0:03:09 7,27    0:00:34 0:00:14 0,41
SRR3034689  RNA-seq/ HiSeq 2500/SE/0.02x    32027136    41726232    1,30    0:00:26 0:02:33 5,88    0:00:31 0:00:10 0,32
SRR3136652  Bisulfite-seq/ HiSeq 2500/SE/0.04x  42756096    65482996    1,53    0:00:37 0:04:19 7,00    0:00:46 0:00:18 0,39
SRR3939105  RNA-seq/Helicos HeliScope/SE/0.1x   45579776    89515224    1,96    0:01:46 0:07:59 4,52    0:01:43 0:00:28 0,27

In short, xz is slower on compression time than ALAPY Compressor and creates bigger files, but decompression time is much faster. Because of this discussion, we start to wonder about proper tradeoffs between decompression speed, memory usage, compression ratio and compression time.

So we wonder, how many of you use gzip with other than default -6 parameter?

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

I actually use gzip with -2 for temp files and pigz with -4 for temp files or -8 or -9 for permanent files.

Incidentally, I ran a test on 1 million pairs of 2x151bp E.coli reads with various compression types:

740380294 May 12 13:24 clumped.fq
 54172672 May 12 13:26 clumped.fq.ac
 75169543 May 12 13:21 clumped.fq.bz2
 89568191 May 12 13:22 clumped.fq.gz
 63142810 May 12 13:28 clumped.fqz
740380294 May 12 13:16 raw.fq
 53873152 May 12 13:19 raw.fq.ac
135320840 May 12 13:20 raw.fq.bz2
164749191 May 12 13:20 raw.fq.gz
 60250648 May 12 13:23 raw.fqz

Alapy is the clear winner, compression-wise. I do worry a little about the memory consumption, though. It seems like it was not streaming the output to a file while running. Does it go to a temp file somewhere, or does everything stay in memory until it's done?

ADD REPLYlink written 3 months ago by Brian Bushnell13k
1

We are working on memory concern and improved its usage in the new version 1.1 https://github.com/ALAPY/alapy_arc/tree/master/ALAPY%20Compressor%20v1.1 This one writes directly to stdout. Is that version still way too high on memory usage?

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
1

I don't really care about 1GB of overhead all that much, it's more a question of whether it uses a fixed amount of memory, or is input-dependent and might use, say, 100GB ram for a 100GB fastq file.

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell13k
2

Memory usage is not input-depended so it will stay around 1-1.5GB for bigger fastq files like 100GB big

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
1

Currently, we do write temporary files to the hard drive when we compress and are thinking about the ways to avoid this.

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
2

And is it fast in compression and decompression? Does it allow random access?
[My knowledge of compression is fairly limited.]

ADD REPLYlink written 4 months ago by WouterDeCoster20k
2

These are very good questions. We gave out our ALAPY Compressor to several research labs and commercial labs. They reported low cpu and memory usage and fast compression. We will test time, cpu, memory and storage usage of different tools on a big and diverse benchmark. The current version is for archiving, so no random access yet. But in general our algorithm allows it and this is one of the many things we are developing right now.

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k

It looks like the latest version available on your website is still v1.1.4... it does not recognize the "-l" option.

ADD REPLYlink written 27 days ago by Brian Bushnell13k

Thank you Brian. We updated alapy.com website. Should work now.

ADD REPLYlink written 20 days ago by Petr Ponomarenko2.4k

Yes, indeed it works fine now, thanks!

ADD REPLYlink written 8 days ago by Brian Bushnell13k
6
gravatar for lh3
4 months ago by
lh330k
United States
lh330k wrote:

So, I decided to do an evaluation myself. The input file is from GAGE-B. I am looking at "Mycobacterium abscessus 6G-0125-R". I created an interleaved fastq from two ends. The fastq covers the reference genome for a few hundred folds approximately. Here are some numbers. The uq command-line was suggested by uq.py --test on the first 4 million lines in the input. The two numbers on the timing columns give total CPU time and wall clock time.

Type         file size       comp time    decomp time    comp RAM    comp command options
Raw      1,395,283,120
gzip       480,164,689      155.7/156.5     9.8/9.8          688k    N/A
fqzcomp    307,933,961       35.0/20.2     40.9/24.5        61.4M    N/A
alapy      241,187,840      842.3/388.3   351.9/180.5        1.5G    -c
uq+lzma    698,900,480      628.2/633.3                      659M    --compressor lmza --raw DNA QUAL QNAME --pattern 0.1 0.1

I am very impressed by alapy in that it beats fqzcomp by a large margin on this particular data set. However, the huge peak RAM makes me wonder if it is loading the entire data into RAM. If so, this would not work for deep human WGS data. I might be doing something silly with uq, as it is actually worse than gzip. Note that I am not using --sort because that is cheating. Most compressors may gain significant improvements if data are allowed to be shuffled.

Then fqzcomp, the winner of the contest. It produces a larger output file, but it is ~20X faster than alapy and uq on compression. It does not need temporary disk space or large working space in memory like alapy and uq, either. fqzcomp simply reads from a raw fastq stream and writes to a compressed data stream (i.e. it is pipe friendly like gzip). While it is not the best in terms of compression ratio -- fqzcomp was not the best in the original contest, either -- its fast speed, light memory footprint and stream-ability makes it the most practical tool in this evaluation.

Disclaimer: the developer of fqzcomp, James Bonfield, was a colleague of mine and is still a collaborator on samtools/htslib/ga4gh. My view could be biased.

ADD COMMENTlink modified 4 months ago • written 4 months ago by lh330k
2

Thank you, lh3. This is always very important to see independent software testing.

Are this results on HiSeq sample (https://ccb.jhu.edu/gage_b/datasets/M_abscessus_HiSeq.tar.gz) or MiSeq sample (https://ccb.jhu.edu/gage_b/datasets/M_abscessus_MiSeq.tar.gz). Have you tested on Windows or Linux system? Most likely we will include some of these files into the benchmark.

We are very interested in understanding what is more important: compressed file size, compression time, decompression time, peak memory used, peak disk space used or something else? Maybe there should be an option in compression tool to select between these four approaches?

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
2

It is Hiseq. It is also important to include human WGS data sets.

On compression, the top requirement is stream-ability. i.e. you start to get output without reading the whole input file or a significant portion of the file. This way, you can pipe the result to downstream tools without explicitly generating uncompressed files on disk. Decompression speeds comes second, as data is often compressed once and used for multiple times. Then file size. Then compression speed. With streaming, the tool won't take much memory/disk space anyway. Put it another way, the ideal tool should just behave like gzip. It should compress more but with comparable speed. By "comparable", I mean not more than several times slower; an order of magnitude is usually unacceptable.

ADD REPLYlink written 4 months ago by lh330k
1

What you are describing sounds to me as a compressor that can be part of the pipeline or a workflow. Do you think there is any use for a compressor for archiving. I personally have lots of files 1-3 years old that were analyzed long time ago and I have them stored only for archiving. I use the compressor for this purpose. Also, I tried cram format for bam file compression and archiving but it is scary that the original reference I used can be lost and I will lose all my data because of this. This scares me even for public reference genome files. So I do not use cram at all. I tried to remove all fastq files when there are bam files containing all reads (so fastq can be restored) but I see a problem with the need to have original order, as order plays a huge role in alignments. So I rarely remove original fastq files and as a result, there is a lot of data to store in the archive: original fastq, sorted bam, indexes and, vcf files for snps and indels separately, annotations of vcf files and metadata. Is is only my pattern or you also store so much data?

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
2

I believe most people would easily choose a format ready for both archive and analysis purposes over an archive only format, especially when there is a ~20X performance gap. So far as I know, most large sequencing centers discard raw fastq. Converting a sorted bam/cram to unsorted fastq with samtools is probably faster than alapy.

ADD REPLYlink written 4 months ago by lh330k
2

This is interesting. How would you store reads order? Because it affects alignment for paired end read experiments making results irreproducible if the original fastq files were discarded. How can I use it in clinical settings? I understand research projects potential ability to have something irreproducible if there is a known reason and clear understanding of the levels it affects the end result.

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
2

Sorry lh3, it wasn't your fault - 5 weeks ago the README for uQ wasn't very good, so i'm not surprised it didn't work out.

When you run --test and don't specify a --compressor, uQ optimizes for smallest uncompressed file size (for in-memory applications). The uQ file in your tests is uncompressed. The right command would be:

python ./uq.py --input combined.fastq --test --compressor 'lzma -9' --sort none

I didn't use "lzma -9" for my test as that would take too long. Instead i used "pxz -3", as it's much faster and the settings it decides is best is probably very similar to that of lzma. So basically it brute-force checks the best compression for pxz, but then i actually compress with lzma -9. This is why uQ doesn't compress your output file for you, although i admit that was a non-obvious design decision on my part.

After downloading your example file and combining them via cat, i got a raw file slightly larger than yours. Maybe I should have used the "trimmed" reads? Regardless, it doesn't change much - uQ+lzma compressed down to between alapy and fqzcomp. I didn't run all the decompression speed tests or ram tests.

Type         file size       comp time    decomp time    comp RAM    comp command options
IH3 Raw    1,395,283,120
John Raw   1,396,487,030
fqzcomp    328,433,929      54s / 33s         ?                ?        -s9+ -e -q3 -n2
alapy      250,929,152      7m32s / 3m1s      ?                ?
uq+lzma    283,090,524      [ standard lzma -9, taking 10m21s to compress, and 2min to decompress ]
DSRC       336,584,052      1m5s / 14s        ?                ?        -m2

My thoughts on this issue/tool (i've only just read this post!) are:

  1. While I very much admire the work done by James on fqzcomp, and i congratulate him on winning the prize, by his own admission, it shouldn't be used in production. It frequently can't regenerate the original file, and only warns of errors on decompression, as was the case in this test. Now i'm sure with a week of work and testing that could be fixed, but regardless, fqzcomp has a serious rival, DSRC, which i believe is both faster (or roughly the same), compresses smaller than uQ+lzma in most instances, and like uQ also handles DNA of any length/kind/etc, so it's robust. By comparison, fqzcomp has all kinds of assumptions about the data hard-coded in. This is one reason why fqzcomp is so fast - it doesn't read through the file a few times to find a safe encoding like uQ does. DSRC does one better than uQ and performs block compression, using different huffman codes in different compression blocks/streams. I guess it uses a 2bit encoding like fqzcomp/uQ when there's only 4 kinds of DNA letter, but upon seeing a new kind (N, etc) starts a new block with a 3bit encoding - for a short period of time - then back to 2bit (where uQ would always use 3bit). It also means it can be sort-of-randomly accessed like a BAM file.

  2. ALAPY seems to compress files to be smaller than DSRC, which is very impressive. It is marginally slower, but no big deal. Certainly a big improvement on LZMA for this kind of data. I do have concerns with it being a closed-source projects however, which for a compression format is really a deal breaker I think. Also their EULA is a bit restrictive. Having said that, i can't disagree, it appears to reliably compress FASTQ better than any other tool out there.

  3. At the time i put uQ out there, i thought it was generating the smallest FASTQ files (because i'd checked against all the competition entrants), but good compression ratios isn't the point of uQ. The point is, if FASTQ was uQ to begin with, then none of this effort spent finding/developing compression programs would have been needed. You get 99% of the way there with just standard gzip/LZMA. As uQ is at it's heart just a standard Numpy byte-array, it also means that if you make a compression program for uQ you solve the general problem of compressing data in byte arrays - something i believe hasn't been done yet, and the ALAPY authors should look into as there's probably a lot more money to be made there than in FASTQ ;) Regardless, I also don't think uQ should be used in production - or rather, only use it if you're not planning on using FASTQ :P. For this reason I also don't think sorting is cheating, because in an ideal world, the concept of a sorted FASTQ file shouldn't exist. All reads are sequenced in parallel after all. Still, the above values are without sorting. Even with sorting, uQ didn't get the file as small as ALAPY. So ALAPY is the clear winner at the moment :)

ADD REPLYlink modified 3 months ago • written 3 months ago by John11k
2

Wow! We did not expect that much of attention to our first attempt to make a stable NGS data compression tool. It was made to see if we can develop a good NGS data compression tool in the future. We have many more ideas on how to make it faster, with better compression ratio, smaller memory and hard drive usage, random data access, usability, other data formats, etc.

As a company, we want to find a way to have a business model for NGS data compression if we want to spend time and money on improving it. That is why right now it is a closed-source project with the proprietary data format and an algorithm. The end goal is to make this tool very reliable, useful and free for scientific community and paid for a commercial version for companies. So we are very interested in addressing your concerns about the closed .ac format and our EULA. The first idea is to show a very detailed explanation of compressed data format and our algorithm to users that signed an NDA with us. What do you need to feel OK to use ALAPY Compressor for everyday usage and to rely heavily on it?

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
1

Those of us who have been doing NGS for a decade faced, debated, gnashed our teeth, planned for and addressed the storage issue. Those who are just encountering the problem need to do the same.

Heng Li is prescient in his comment: C: NGS files' shrinkage software: ALAPY Compressor, only fastq files so far =)

ADD REPLYlink written 3 months ago by genomax32k
5
gravatar for Petr Ponomarenko
4 months ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.4k wrote:

So, obviously, there are many many things to do in compressor improvement, in testing it and other tools and so on. Could you please tell what improvements are very important to you. So far I noticed 3 questions:

  1. Test more compression tools on our benchmark (uQ, Clumpify, xz and many more from uQ - small binary FASTQ discussion)
  2. Random access
  3. Speed of compression/decompression

What are other things you would love us to do? Maybe you want to do it together with us?

ADD COMMENTlink written 4 months ago by Petr Ponomarenko2.4k
4

As I commented in some other thread, on FASTQ compression, the state of art is represented by this paper. Any new tools should be compared to it. This paper is worth reading. It is in fact reporting the results of a coding contest on sequence compression. One of the authors, James Bonfield, is the winner of this contest. He was my colleague at Sanger and is behind many formats and tools such as staden, ztr, srf (which sra learns from) and cram formats. In bioinformatics, he is absolutely the expert in data compression and one of the best C programmers.

EDIT: see also discussions in this thread about the contest. Both authors of the paper were active there.

ADD REPLYlink modified 4 months ago • written 4 months ago by lh330k
2

I agree. We also enjoyed that paper by James K. Bonfield and Matthew V. Mahoney.

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
4
gravatar for Petr Ponomarenko
3 months ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.4k wrote:

Some of our clients and testers and Biostar's community here, including Ih3 and maybe genomax (by the way, why did you change it from genomax2?) were talking about VCF files analysis when you get big number of files, like millions of samples and want to store and analyze them, compare between each other efficiently. We agree with you and think that such a system is going to be very needed. This is why we started to work on it. Here we post a version of our vcf variant annotation and interpretation server solution, ALAPY Genome Explorer Variant annotation and filtration server ALAPY Genome Explorer (AGx) that can be used via any modern browser. It has limited GUI functionality, but with your ideas and suggestions, we will understand how to make this functionality of analyzing many many samples available via GUI or API. What do you think about it?

ADD COMMENTlink written 3 months ago by Petr Ponomarenko2.4k
3
gravatar for Petr Ponomarenko
3 months ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.4k wrote:

Dear Biostars' administrators, is it ok to write about software updates in the answer section?

Thank you all for valuable ideas and testing of ALAPY compressor tool. Based on your valuable discussion and inputs from our customers we created a new release of the tool.

The latest version can be downloaded from ALAPY website here: http://alapy.com/services/alapy-compressor/#download . Also, you can find all versions on the github here: https://github.com/ALAPY/alapy_arc .

We made some changes and improvements as listed below:

* Added ability to output results of decompression to stdout (see help for the -d / - decompress option)   
* Added ability to compress data from stdin (see help for the -c / - compress option)    
* Changed input validation module for stdin/stdout support   
* Improved synchronization of threads (compression speed increased on average by 15%)
* Changed data decompression module (decompression speed on average increased by 30%)
* Optimized intermediate data decompression to write directly to output file or stdout
* Fixed end of line characters handling
* Fixed comparison of reads’ headers and comments

So now it is faster, uses less memory and disk space, can work with stdin/stdout and this allows easier integration into pipes. So now you can also use it these ways:

alapy_arc_1.1 -d your_file.fastq.ac - | fastqc /dev/stdin
bwa mem reference.fasta <(alapy_arc_1.1 -d your_file_R1.fastq.ac - ) <(alapy_arc_1.1 -d your_file_R2.fastq.ac - ) > your_bwa_mem.SAM

FastQC supports "stdin" as a parameter to read from input stream this way:

alapy_arc_1.1 -d your_file.fastq.ac - | fastqc stdin

We will continue working on improving the algorithm and testing it.

Our tool so far was used on more than 1000 different fastq files and md5 sum before and after compression for fastq files is exactly the same in all cases.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Petr Ponomarenko2.4k

It would be best to edit original post since the latest info would always be in the main post. You could add a section below the original post if all that text needs to stay. Editing original post automatically bumps it up to main page :)

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax32k
1

Thank you genomax2. I will update my main post if it is ok that it will be bumped up because as a result. I saw several people on this forum were asked to not bump their topics to the top too frequently.

I was thinking about publishing as many ways to use stdout functionality of ALAPY Compressor with programs like FastQC, BWA, Bowtie that accept file names as parameters. I knew 3 ways to do it: 1) /dev/stdin 2) bash automatic substitution via <(command) and 3) named pipes. I decided to not publish named pipes as there need to be a named pipe names control process. I decided to not include information about named pipes because a script managing these names is needed especially if many processes are going to be run simultaneously. By any chance does anybody know other methods?

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k

Adding an update to a tool seems like a fine reason for bumping ;-)

ADD REPLYlink written 3 months ago by WouterDeCoster20k

Since you are adding new information that supersedes pre-existing data, a bump is perfectly essential/applicable.

Having new information buried in an answer (those can move up or down based on # of votes) would actually make it confusing to find the latest info.

ADD REPLYlink written 3 months ago by genomax32k
2
gravatar for Petr Ponomarenko
4 months ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.4k wrote:

As genomax2 and Deepak Tanwar recommended we started testing other software rigorously on our benchmark and measure CPU time and memory used to compress and decompress as WouterDeCoster suggested. We will publish results here later.

So far we see that on the first 11 samples from our benchmark clumpify failed on SRR4289117 and other programs and other samples were ok. Compression time of xz is approximately 4.5-7.6 times greater than of ALAPY Compressor but decompression time is 2.5-3.7 times faster. uQ+gzip is also 2.9-19.9 times slower than ALAPY Compressor.

In terms of compressed file size, ALAPY Compressed file is always smaller than xz compressed by 1.3-2 times and smaller than uQ+gzip compressed by 1.1-1.7 times.

Do you think reordering reads is ok? It affects how aligners collect statistics on the sample and they assume a random distribution, aren't they? Alignment results after reads reordering are slightly different.

Our algorithm allows fast random access. Could you please tell us if this is more important than higher ratio of compression, faster speed, BAM compression, parallelization and other things? What is needed so you can start using ALAPY Compressor for all your samples or at least for archiving?

ADD COMMENTlink written 4 months ago by Petr Ponomarenko2.4k
1

clumpify failed on SRR4289117

I tested clumpify on SRR4289117.fastq.gz (obtained from EBI-ENA) and got it to work. Remember to add qin=33 (needed since this is ABI-SOLiD data).

Reads In:           73653740
Clumps Formed:       3307072

On a different note (being a devil's advocate): Why is compressing fastq data further/better a need? Why not allow NCBI/EBI/DDBJ to hold/host your data for free? Saving data to tape allows for efficient compression on backup tape automatically so that takes care of the local end of things.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax32k
3

Thank you, so there should be a way for a wrapper script to notice AB SOLiD data type and change parameters accordingly.

We are working on making a set of (hopefully) public scripts to download benchmark and software for testing, as well as to run all of these tools and collect cpu time for compression/decompression, memory and hard drive disk space usage.

Regarding the fastq file compression for us is a first step in optimizing NGS and genetics data managements and analysis in general. We already have efficient vcf file repression in house for fast access, and we wanted to have the same for BCL fastq and BAM/SAM/CRAM files a bit later with added support of map,ped, bed, bim, fam and other common file types. Fastq is just very logical first step on that way.

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

Pardon my skepticism but I am only looking at this from an end user perspective ...

Unless these compression schemes are co-opted by sequencer manufacturers they are going to present another hurdle for a normal bench biologist. As is things are an uphill battle (for various reasons) and introducing another step will make things more complex.

As for saving on storage space on a large scale (this may only be applicable for NCBI/ENA's of the world) there are elegant solutions that de-dupe/compress data "in-line" while it is being written to the disk (or the tape backup compression that happens automatically). As with everything there are some downsides. Some of these solutions use proprietary compression algorithms (which use dedicated FPGA's etc) adding another layer.

ADD REPLYlink written 4 months ago by genomax32k
1

Dear genomax2, do you or your friends/colleagues use compression of fastq.gz, bam or vcf.gz files at least for archiving? If yes, what do you use and why? If you do not use, then what prevents you from doing so? Thank you, Petr

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

As I have said before we rely on the compression achieved by backup software and LTO-6 tape hardware, which includes encryption. Depending on compressibility of data we can fit ~5 TB (or more) on an LTO-6 tape. At less than $30 per tape it is still the most cost effective way of archiving data.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax32k
1

I agree, but at the same time, I see doctors and researchers receiving hard drives over mail with terabytes of fastq, bam and vcf files to store and analyze. If sequencing prices are are going to continue to go down and amounts of data per sample will continue to grow up, I can imagine a moment where archiving on a tape will be more expensive then sequencing itself. Then data storage, transfer, and analysis in NGS will hinder further adoption of the technology and its progress. Anyway, right now I have a much simpler problem of having way too little data on one of my servers because several projects outgrew the same and number of samples/analysis we planned and now I either need to transfer them to another bigger server or find a few terabytes of extra space on the current one. I can not remove fastq.gz files, but I can compress some of fastq.gz files with any lossless algorithm.

Is it only my problem only?

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k

I see doctors and researchers receiving hard drives over mail with terabytes of fastq, bam and vcf files to store and analyze

This is not a new problem. With a few exceptions, most doctors would prefer to have a chart/figure/table to look at summaries of actionable items rather than piles of sequence data since they don't have time to worry about specifics. Training new physicians so they become comfortable with the new generation technology is a big hurdle, which is being tackled as a part of med school curriculum. Re-training older generation is a different problem altogether.

Anyway, right now I have a much simpler problem of having way too little data on one of my servers because several projects outgrew the same and number of samples/analysis we planned and now I either need to transfer them to another bigger server or find a few terabytes of extra space on the current one. I can not remove fastq.gz files, but I can compress some of fastq.gz files with any lossless algorithm.

Something in that statement does not make sense. Perhaps you meant to say you have too little free space instead of data?

We will certainly get to a point where sequencing would become an easy/cheap commodity where storing any sequence will become pointless.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax32k
1

Genomax2, you are right. I have very little amount of free space on one of our servers

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k

Let me buy some stock in storage companies before you put an order in :)

On a serious note, one should never get to this stage. I am (perhaps incorrectly) assuming that you are not an infrastructure person? You need to convince the owners of the data/your CIO that a storage failure could lead to disaster (especially if you have not done any backup, tape or otherwise, again this assumption may be incorrect). Compression of data to free up space should be the last of your worries.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax32k
1

:) it is a server used for research by many people. Sure I told server owners and managers that storage is a big problem.

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

What, Clumpify works on SOLiD data? I had no idea! I intentionally stripped out most of the SOLiD-specific code from BBTools a while ago so I'm quite surprised.

...I downloaded the fastq, and it looks like this:

@SRR4289117.1 1/1
CGGAGTTGCGTTCTCCTCAGCACAGACCCGGAGAGCACCGCGAGGGCGGA
+
=NEPLQI7N\]MAP\_G/?U^XLPZ^D4LLMB*D]Y?+AYX>)!$GD*83

Is that SOLiD data? I'm not really sure how SOLiD data gets stored by SRA, but it certainly is not colorspace...

ADD REPLYlink written 3 months ago by Brian Bushnell13k
1

This experiment was run on AB SOLiD 4 System. Reads in SRA SRR4289117 are not in color space and this is why Clumpify worked on it. Our purpose was to show compression ratio and processing time with different sequencing platforms. We have not looked at color space fastq files compression in depth yet.

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k

OK, interesting. I used to work with SOLiD 4, and it only produces colorspace. The only way to correctly make base-space data from the reads is to map and postprocess them, which is why I'm surprised that they show up as base-space fastq from an archive...

But either way, it doesn't really matter. I would not worry about optimizing compression algorithms for colorspace, since it's obsolete and was never a good idea anyway. So far your approach looks very impressive!

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell13k
1

Please correct me if I am wrong, I think you might be able to convert back to colorspace if you know the first base or you can even make it up and the colorspace encoding will be the same other then for the first color space number in a sequence.

So from
CGGAGTTGCGTTCTCCTCAGCACAGACCCGGAGAGCACCGCGAGGGCGGA
you can make
T23022101331022202212311122100302222311033322003302
using this table for the transformation and if you begin with T
AA, CC, GG, TT : 0
AC, CA, GT, TG : 1
AG, CT, GA, TC : 2
AT, CG, GC, TA : 3

Regarding the quality scores, it depends on how these were calculated during the conversion from color space to basespace. They originally might not have been changed during the conversion at all.

An interesting part of this benchmark we are working on is that by default fastq-dump in the SRA toolkit should have used colorspace encoding for SOLiD data (there are options -C and -B to override default encoding of the experiment and use colorspace or basespace specifically, but we did not use them during the download). Why for some fastq files that were done using SOLiD platform SRA defaults to basespace encoding and how much this affects overall analysis results by the bioinformatics community?

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k

Raw SOLiD data ought to be kept in the color space because once converted to bases naively, one color error leads to consecutive base errors to the end of the read or until the next color error. Anyway, just forget SOLiD. No one cares about it nowadays.

ADD REPLYlink modified 3 months ago • written 3 months ago by lh330k

Do you think reordering reads is ok?

Wouldn't this be problematic for paired-end sequencing?

Are you only testing for Illumina sequencing, or also for long reads such as PacBio/Nanopore? I'm not sure if that would make a difference in your benchmarking.

ADD REPLYlink written 4 months ago by WouterDeCoster20k
2

There is a way to reorder two files, this lowers compressed file size.

Our benchmark includes PacBio, Helicos, AB SOLiD and Ion Torrent plus NextSeq and HiSeq from Illumina. We do not have MySeq and Nanopore in it yet, as well 10X Genomics. Working on adding them as well.

Also we have gene panels, WES (WXS), WGS, RNA-seq ChIP-seq and BS-seq data in the benchmark. Also our first users tried it with 16S-seq, but we do not have such data in the benchmark yet. What other sequencing strategies needed to be included in the testing in your opinion?

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

There is a way to reorder two files, this lowers compressed file size.

You should keep the read order consistent in the two files (unless you are interleaving reads) to prevent downstream issues.

ADD REPLYlink written 4 months ago by genomax32k

Seems you are already testing data from a lot of technologies. For Nanopore data you could take a look here.

Can your tools read from stdin (for compression) and stream to stdout (for decompression)?

ADD REPLYlink written 4 months ago by WouterDeCoster20k
3

Thank you, WouterDeCoster. I am very grateful for your recommendation and are looking forward to testing Nanopore.

Technically we can read/write to stdin but this is not implemented in the published version. We are not sure of safety, usability and want to understand users needs better in having these two features. Also providing log information is a bit harder. Could you please recommend the best way to use stdin and stdout in your case?

Thank you

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
1

Hi Petr,

Thanks for the quick replies. For stdin/stdout I can imagine the following scenarios:

  • Filter a fastq file (e.g. by quality, or trim) and directly stream into ALAPY for compression to avoid intermediate files
  • Stream from an ALAPY compressed fastq directly to e.g. bwa for aligning

But that's maybe not a top priority, could just be a nice feature I think.

ADD REPLYlink written 4 months ago by WouterDeCoster20k
2

Good examples. What do you think of piping CASAVA directly into ALAPY Compressor and ability to use FastQC, BWA, Bowtie and other tools with no decompression? Same for bam files. This is my vision. Our compressor is already better then sra. We will make it better, faster, lower on RAM and so on. I wish NCBI adopt it. It will save their money and I personally consider this a very good thing. Is this possible? What should we do in order to achive it?

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
2

ability to use FastQC, BWA, Bowtie and other tools with no decompression?

Sounds great, but this would require that all other tools adapt to ALAPY, which I believe is rather unlikely. I'll buy you a beer if it works out! Piping avoids having such dependencies and all tools which can handle fastq on stdin can also handle your compression format.

I wish NCBI adopt it. It will save their money and I personally consider this a very good thing. Is this possible? What should we do in order to achive it?

I have no idea at all about this but wish you most of luck to convince them!

ADD REPLYlink modified 4 months ago • written 4 months ago by WouterDeCoster20k
2

ability to use FastQC, BWA, Bowtie and other tools with no decompression? Sounds great, but this would require that all other tools adapt to ALAPY, which I believe is rather unlikely. I'll buy you a beer if it works out! Piping avoids having such dependencies and all tools which can handle fastq on stdin can also handle your compression format.

Dear WouterDeCoster stdin/stdout support added. Let's have a beer on some conference to discuss NGS future. What will convince you to begin using fastq or bam compressor now?

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
1

Now that stdin/stdout are added, I can add ALAPY support in BBTools, at least :) I already did for fqzcomp, and it only took an hour or two. The only thing I need is a unique file extension to designate that the file has been ALAPY-compressed. How about ".alapy"? As in, "foo.fastq.alapy". Or do you already use something else?

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell13k
1

.ac appears to be the default extension for ALAPY.

ADD REPLYlink written 3 months ago by genomax32k
1

Wow. This is very interesting. Thank you, Brian! We do use .ac as an extension as genomax2 said. We maintain backward compatibility so everything that was compressed with the old version will decompress properly on the latest version and all the features will work including stdin/stdout.

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
1

We tested different tools if they will work with decompression piping. Now we tested with your as well and it works. Hope it will be easy to add .ac support in BBMap because of this.

$ ~/work/alapy_arc_1.1 -d ~/work/test_files/test.fastq.ac - | ./bbmap.sh ref=~/work/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa in=stdin out=mapped.sam usemodulo=t

Regarding piping of compression so .ac compressed fastq file will be streamed to stdout, could you please give an example of its possible usage? Why do you need it?

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k

I thought this was necessary to plug it in to the standard way BBTools handles external compression/decompression tools with minimal additional code, but upon review, it looks like I don't need that after all and it should be fine as is. However, the one thing I do need is a fixed name for the executable to use (like "alapy_arc") - if every version has a different executable name, BBTools won't be able to find it.

ADD REPLYlink written 3 months ago by Brian Bushnell13k
1

Good, thank you. Let's stick with alapy_arc then if this sounds appropriate.

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
2

OK, another issue - there does not seem to be a way to name the output file. For example:

cat 8k.fq | ./alapy_arc -c -

That produces a file outside of the current working directory, named after the working directory. I've got input working fine, just not output. So for example you can do this:

bbduk.sh in=reads.fq.ac out=filtered.fq.ac minavgquality=7

...which will work, but the output file will not be in the correct place with the expected name.

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell13k
1

Also, it looks like it is creating a temp directory with a random name in the working directory, and not deleting it when the process ends.

ADD REPLYlink written 3 months ago by Brian Bushnell13k
2

Thank you, Brian, we addressed your issues in the new version 1.1.3 https://github.com/ALAPY/alapy_arc/tree/master/ALAPY%20Compressor%20v1.1.3

Change log: * fixed temporary data cleaning on SIGINT (Cntrl+C), SIGTERM and SIGABRT signals * fixed output file name for cases of stdin compression with no name specified with -n option

You can now use -n --name option to specify output file name and -o --outdir to specify the output directory. So you can now do

cat 8k.fq | ./alapy_arc -n outfile_name -o outfile_dir -c -

Could you please explain when you did you got a temporary folder not removed after the program finished? We spotted that behavior on SIGINT (Cntrl+C), SIGTERM and SIGABRT signals in Unix and fixed it. Thank you, it was very helpful and we are eager to make this tool good for your needs.

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k
1

Thanks, writing .ac output files now works as expected.

I'm not really sure what's going on with the temp directories. Alapy seems to work fine on the command line, but when I spawn a decompression process from Java, the temp directory does not get deleted. My code looks like this (where "p" is a Process):

if(verbose){System.err.println("Trying p.waitFor()");}
try {
  x=p.waitFor();
  if(verbose){System.err.println("success; return="+x);}
  break;
} catch (InterruptedException e) {
  if(verbose){System.err.println("Failed.");}
  e.printStackTrace();
}

The description of Process.waitFor():

Causes the current thread to wait, if necessary, until the process represented by this Process object has terminated.

Result:

Trying p.waitFor()

success; return=0

So it doesn't seem like I'm sending the process any kind of signal, and I'm getting a return code of 0. In other words, the process seems to be ending normally.

OK, I see part of the problem - I'm opening the file twice; the first time is just to read the first few lines to figure out things like the format and where it is interleaved or not. I don't read the entire file. The temp file is left over from that read. The second case, where I read the whole file, works fine. I modified the code slightly:

if(verbose){System.err.println("Trying p.waitFor()");}
try {
  long t=System.nanoTime();
  if(verbose){System.err.println("p.isAlive()="+p.isAlive());}
  x=p.waitFor();
  if(verbose){System.err.println(System.nanoTime()-t+" ns");}
  if(verbose){System.err.println("success; return="+x);}
  break;
} catch (InterruptedException e) {
  if(verbose){System.err.println("Failed.");}
  e.printStackTrace();
}

Now it prints this, for the first process:

Trying p.waitFor()
p.isAlive()=true
34073996 ns
success; return=141

It looks like 141 means SIGPIPE, so I'm guessing that the process keeps writing to a pipe even though I'm not reading from it any more, which gets too full and breaks. Or something like that.

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell13k
2

Thank you, Brian! We fixed SIGPIPE handling and now it is version 1.1.4 https://github.com/ALAPY/alapy_arc/tree/master/ALAPY%20Compressor%20v1.1.4 .

Now it should work like this:

```    public static void main(String[] args) {

        try {
            String line;
            Process p = Runtime.getRuntime().exec(path_to_alapy_bin + " -d  " + path_to_sample + " -");
            BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
            if (p.isAlive()) {
                for (int i=0;i < 4;++i) {
                    line = bri.readLine();
                    if (line == null) {
                        break;
                    }
                    System.out.println(line);
                }
            }
            bri.close();

            int return_code = p.waitFor();
            System.out.println("Done (return_code = " + return_code + ")");
        } catch (Exception err) {
            err.printStackTrace();
        }
    }
```

path_to_alapy_bin - path to alapy_arc

path_to_sample - path to fastq file

Expected output:

 ```@SRR3034689.1 1 length=51
GTATCAACGCAGAGTACATGGGAAAGGTTTGGTCCTGGCCTTATAATTAAT
+SRR3034689.1 1 length=51
@==DAD4AFA?@1CFFG@EHF?F>CB81?CGIGBEGDDBFGAAFD@FFFIB
Done (return_code = 243)
```

now there should be no tmp files and folders

ADD REPLYlink modified 3 months ago • written 3 months ago by Petr Ponomarenko2.4k

Hi Petr,

I tried it, and it seems to work fine. I've uploaded the new version of BBMap (37.24) with Alapy compression support (as long as alapy_arc is in the path). Thanks for the fix!

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell13k
1

Thanks for finding all these bugs, Brian!

ADD REPLYlink written 3 months ago by Petr Ponomarenko2.4k

Dear Brian, We updated ALAPY Compressor to v.1.2. Added compression levels best, medium, fast. Medium is the default. Now it uses less memory and on the fast level it is 2 times faster than gzip on some fastq files.

ADD REPLYlink written 11 weeks ago by Petr Ponomarenko2.4k

OK, good. Actually the fast level sounds quite impressive! I'll add support for those soon (mapping them to the "zl" flag which controls the compression level in gzip and bzip2).

ADD REPLYlink written 11 weeks ago by Brian Bushnell13k
1

Sounds great. I'll start using it on Monday ;-)

or bam compressor

Did I miss something, have you added bam support?

ADD REPLYlink written 3 months ago by WouterDeCoster20k
1

Sound awesome, WouterDeCoster. Thank you. We are working on bam compressor already. It is not ready for public testing yet. But, we would love to discuss what features are important for bam compression/decompression. If SAM and CRAM compression is important as well?

ADD REPLYlink modified 3 months ago • written 3 months ago by Petr Ponomarenko2.4k
2

If you are in academia, go ahead. SAM compression is fun and once published, your idea can contribute back to the community. If you are a company, working on a SAM compressor is mostly likely to be wasting your resources. Fastq files are streamed most of time. A new compressor doesn't need to be deeply integrated into 3rd-party tools. SAM files are often randomly accessed. To use a new BAM alternative, 3rd-party tools have to access data at the API level. The chance for other tools to call proprietary APIs is near to zero. Then this winds back to archival vs analysis formats. When there is a good enough analysis ready format like cram and bam, no one will use an archival only format. Save your resources on something else.

ADD REPLYlink written 3 months ago by lh330k

Some things I think about concerning bam compression are that you still have to be able to use variant calling tools, or bedtools, or have a look at the reads in igv. Not sure if things like that are compatible with compression, though.

I don't really see an added value of compressing sam (if you can compress bam).

ADD REPLYlink written 3 months ago by WouterDeCoster20k
1

Great. Thank you for the offer, WouterDeCoster. We like beer a lot =)

ADD REPLYlink written 4 months ago by Petr Ponomarenko2.4k
2

I wish NCBI adopt it. It will save their money and I personally consider this a very good thing. Is this possible? What should we do in order to achive it?

As I said before at the scale where NCBI/EBI operate a hardware based solution (in-line compression/dedupe) would likely be more efficient than a software based solution.

Best case use scenario would be to have ALAPY (or any) compressor work transparently. @Brian does something similar with pigz (which is a threaded replacement for gzip). BBMap suite will use pigz, if it is available in $PATH, but will fall back to gzip seamlessly, if it is not there (with switches to turn off pigz if wanted/needed).

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax32k
2

For a commercial product, there are a lot more to think than being technically "better". Proprietary, timing, compatibility and conventions are more important than technical advances. To be frank, I seriously doubt your fastq/sam compression tools could ever get a sizable market. If you are good at compression, work on VCF compression for 1M samples and the associated analysis tools.

ADD REPLYlink written 3 months ago by lh330k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1310 users visited in the last hour