Question

Optimizing Java Jar's parameters while processing large high throughput sequencing data

4

Entering edit mode

7.7 years ago

SOHAIL ▴ 400

Hi everybody,

Like other NGS data analysts, as a new user i am also utilizing java applications designed for NGS data analysis. However, i saw various Java jar's optimizing parameters in command lines while processing large data sets. For example, recently i have come across two example commands, i.e.:

java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ... .

Parameters: -Dsamjdk.buffer_size -XX:GCTimeLimit -XX:GCHeapFreeLimit -Xmx128m -XX:+UseStringCache -Dsamjdk.use_async_io=true

i try to learn about them from internet but not clear about how to set these parameters (bold text in above mentioned command-line) for large data sets while piping various tools together. As i have biological background, could anyone please explain a bit in detail how to use them with one-line comprehensive definition and purpose.

Thank you very much!

ngs java • 3.2k views

ADD COMMENT • link updated 5.8 years ago by Biostar 20 • written 7.7 years ago by SOHAIL ▴ 400

0

Entering edit mode

Hi Pierre,

Thank you for explaining each aspect comprehensively. :) However, i have one confusion is that memory allocation for buffers in JAVA comes after from heap's allocated memory, or is it independently assigned in JVM?

ADD REPLY • link 7.7 years ago by SOHAIL ▴ 400

0

Entering edit mode

buffer like samjdk.buffer_size is the size of memory that the htsjdk library will use for storing short-reads in memory. For example, when writing, the reads are stored in a memory buffer. When this buffer is full, the reads are written to disk. The largest it is, the fastest is your application (reduce I/O) but the more you need memory (heap)

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 161k

score 5 · Answer 1 · 2016-07-29

The samjdk.* properties are loaded in https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/Defaults.java (doc in https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/Defaults.html )

buffer_size : Buffer size, in bytes, used whenever reading/writing files or streams. Default = 128k.
compression_level : Compresion level to be used for writing BAM and other block-compressed outputs. Default = 5.
use_async_io seems deprecated and replaced with use_async_io_read_samtools, etc...

GC is the acronym of "garbage collector": This is a special java thread that release the memory of all the object that are not anymore used. e;g:

Integer a = new Integer("1234");
a = new Integer("5658"); // GC should release the memory of the previous object

googling for the other parameters:

GCTimeLimit: "The upper limit on the amount of time spent in garbage collection in percent of total time (default is 98)." GCHeapFreeLimit : "The lower limit on the amount of space freed during a garbage collection in percent of the maximum heap (default is 2)."

UseStringCache: "Enables caching of commonly allocated strings."

e.g:

## only one string allocated here:
String s1=new String("chr1");
String s2=new String("chr1");