Optimizing Java Jar's parameters while processing large high throughput sequencing data
1
4
Entering edit mode
7.7 years ago
SOHAIL ▴ 400

Hi everybody,

Like other NGS data analysts, as a new user i am also utilizing java applications designed for NGS data analysis. However, i saw various Java jar's optimizing parameters in command lines while processing large data sets. For example, recently i have come across two example commands, i.e.:

java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ... .

Parameters: -Dsamjdk.buffer_size -XX:GCTimeLimit -XX:GCHeapFreeLimit -Xmx128m -XX:+UseStringCache -Dsamjdk.use_async_io=true

i try to learn about them from internet but not clear about how to set these parameters (bold text in above mentioned command-line) for large data sets while piping various tools together. As i have biological background, could anyone please explain a bit in detail how to use them with one-line comprehensive definition and purpose.

Thank you very much!

ngs java • 3.2k views
ADD COMMENT
0
Entering edit mode

Hi Pierre,

Thank you for explaining each aspect comprehensively. :) However, i have one confusion is that memory allocation for buffers in JAVA comes after from heap's allocated memory, or is it independently assigned in JVM?

ADD REPLY
0
Entering edit mode

buffer like samjdk.buffer_size is the size of memory that the htsjdk library will use for storing short-reads in memory. For example, when writing, the reads are stored in a memory buffer. When this buffer is full, the reads are written to disk. The largest it is, the fastest is your application (reduce I/O) but the more you need memory (heap)

ADD REPLY
5
Entering edit mode
7.7 years ago

The samjdk.* properties are loaded in https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/Defaults.java (doc in https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/Defaults.html )

  • buffer_size : Buffer size, in bytes, used whenever reading/writing files or streams. Default = 128k.
  • compression_level : Compresion level to be used for writing BAM and other block-compressed outputs. Default = 5.

  • use_async_io seems deprecated and replaced with use_async_io_read_samtools, etc...

GC is the acronym of "garbage collector": This is a special java thread that release the memory of all the object that are not anymore used. e;g:

Integer a = new Integer("1234");
a = new Integer("5658"); // GC should release the memory of the previous object

googling for the other parameters:

GCTimeLimit: "The upper limit on the amount of time spent in garbage collection in percent of total time (default is 98)." GCHeapFreeLimit : "The lower limit on the amount of space freed during a garbage collection in percent of the maximum heap (default is 2)."

UseStringCache: "Enables caching of commonly allocated strings."

e.g:

## only one string allocated here:
String s1=new String("chr1");
String s2=new String("chr1");
ADD COMMENT

Login before adding your answer.

Traffic: 2435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6