Question: Optimizing Java Jar's parameters while processing large high throughput sequencing data
1
gravatar for SOHAIL
2.6 years ago by
SOHAIL240
Beijing Institute of Genomics, CAS.
SOHAIL240 wrote:

Hi everybody,

Like other NGS data analysts, as a new user i am also utilizing java applications designed for NGS data analysis. However, i saw various Java jar's optimizing parameters in command lines while processing large data sets. For example, recently i have come across two example commands, i.e.:

java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ... .

Parameters: -Dsamjdk.buffer_size -XX:GCTimeLimit -XX:GCHeapFreeLimit -Xmx128m -XX:+UseStringCache -Dsamjdk.use_async_io=true

i try to learn about them from internet but not clear about how to set these parameters (bold text in above mentioned command-line) for large data sets while piping various tools together. As i have biological background, could anyone please explain a bit in detail how to use them with one-line comprehensive definition and purpose.

Thank you very much!

java ngs • 1.3k views
ADD COMMENTlink modified 7 months ago by Biostar ♦♦ 20 • written 2.6 years ago by SOHAIL240

Hi Pierre,

Thank you for explaining each aspect comprehensively. :) However, i have one confusion is that memory allocation for buffers in JAVA comes after from heap's allocated memory, or is it independently assigned in JVM?

ADD REPLYlink written 2.6 years ago by SOHAIL240

buffer like samjdk.buffer_size is the size of memory that the htsjdk library will use for storing short-reads in memory. For example, when writing, the reads are stored in a memory buffer. When this buffer is full, the reads are written to disk. The largest it is, the fastest is your application (reduce I/O) but the more you need memory (heap)

ADD REPLYlink written 2.6 years ago by Pierre Lindenbaum116k
2
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

The samjdk.* properties are loaded in https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/Defaults.java (doc in https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/Defaults.html )

  • buffer_size : Buffer size, in bytes, used whenever reading/writing files or streams. Default = 128k.
  • compression_level : Compresion level to be used for writing BAM and other block-compressed outputs. Default = 5.

  • use_async_io seems deprecated and replaced with use_async_io_read_samtools, etc...

GC is the acronym of "garbage collector": This is a special java thread that release the memory of all the object that are not anymore used. e;g:

Integer a = new Integer("1234");
a = new Integer("5658"); // GC should release the memory of the previous object

googling for the other parameters:

GCTimeLimit: "The upper limit on the amount of time spent in garbage collection in percent of total time (default is 98)." GCHeapFreeLimit : "The lower limit on the amount of space freed during a garbage collection in percent of the maximum heap (default is 2)."

UseStringCache: "Enables caching of commonly allocated strings."

e.g:

## only one string allocated here:
String s1=new String("chr1");
String s2=new String("chr1");
ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum116k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2491 users visited in the last hour