Question: Select sequences from fastq.gz file
gravatar for msameet
5.4 years ago by
United States
msameet30 wrote:

Is there a way to sample sequences from the `fastq.gz` file?  I have files that contain hundreds of millions of reads, I want to sample about 100 million reads from each file.  Is there a way to do this?

ADD COMMENTlink modified 5.4 years ago by Peter5.8k • written 5.4 years ago by msameet30

duplicate of

Selecting Random Pairs From Fastq?

How To Randomly Select 20M Reads From A 200M Fastq File


ADD REPLYlink written 5.4 years ago by Pierre Lindenbaum127k

No, I dont want to extract the files.  I saw the question, it assumes that the files are "extracted", no longer a *.gz.

ADD REPLYlink written 5.4 years ago by msameet30
gravatar for Brian Bushnell
5.4 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

BBTools's reformat tool does this, keeping pairs together if they are interleaved: in=reads.fq.gz out=sampled.fq.gz samplereadstarget=100000000

or for reads in 2 files: in1=reads1.fq.gz in2=reads2.fq.gz out=sampled1.fq.gz out2=sample2.fq.gz samplereadstarget=100000000

You can also just sample a fraction of the reads without respect to the final number, which is faster since it only needs to read the file once: in=reads.fq.gz out=sampled.fq.gz samplerate=0.1

or just sample the first X reads, which is faster still: in=reads.fq.gz out=sampled.fq.gz reads=100000000

Reformat is multithreaded and extremely fast.

ADD COMMENTlink written 5.4 years ago by Brian Bushnell17k

This gives the following error

sh ~/downloads/bbmap/ in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000

java -ea -Xmx200m -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000

Exception in thread "main" java.lang.ClassFormatError: jgi.ReformatReads (unrecognized class file version)

   at java.lang.VMClassLoader.defineClass(

   at java.lang.ClassLoader.defineClass(



   at java.lang.ClassLoader.loadClass(

   at java.lang.ClassLoader.loadClass(


Is there a way around this?

My java version is 

java -version

java version "1.7.0_03"

Java(TM) SE Runtime Environment (build 1.7.0_03-b04)

Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode)

ADD REPLYlink written 5.4 years ago by msameet30

That's very strange; it should work with any Java 7 runtime.  It seems like a different java is is being called when running the program than when you run java -version, or that Java is configured incorrectly so that you have the executable for Java 7 but you are not using the standard libraries, since it is loading libgcj (from gnu) rather than the Oracle libraries.

You can try the latest Java 6 build, here, which should work, or else you can install Oracle's Java 7 JRE.

It's also possible that manually pathing to the location of Oracle's java executable would work:

/path/to/Java7/jre/bin/java -ea -Xmx200m -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000


Edit: Update - BBTools works best with Java 8+ now. It is fully compatible with Java 7 and mostly compatible with Java 6. But if your default Java version is 6 (or 7), I suggest you encourage your sysadmins to upgrade to a modern version of Java.

For reference - Java is fully forward and backward compatible, which is one of the things I love about it. However, you cannot use Java libraries written after your last download of Java. So, if I write a program that uses parallel sort (like Clumpify), which is supported in Java 8 but not Java 7... if you run Java 8 it will be faster, and in Java 7 it will be slower, since I wrote custom code to handle that scenario. But for BBNorm, which uses a Java library class specific to Java 7+, you can't even run it in Java 6 since it uses a class that did not exist then.

ADD REPLYlink modified 2.8 years ago • written 5.4 years ago by Brian Bushnell17k

Yes, that worked. Thanks.

ADD REPLYlink written 5.4 years ago by msameet30

Hi, do you have any estimates on time taken?

ADD REPLYlink written 5.3 years ago by msameet30

The time depends on the speed of the disk subsystem and whether you have pigz installed, which accelerates compression and decompression.  But for randomly sampling an exact number of reads, Reformat takes the amount of time needed to read the file twice; for sampling X fraction of the reads, it reads the file once; and for just sampling the first X reads, the amount of time depends on X.

If you have a 10 GB raw file, it could be processed in about 20 seconds (if your disk system is fast enough).  A 10 GB raw file that has been gzipped would take more like 125 seconds, assuming decompression at around 80MB/s.


ADD REPLYlink written 5.3 years ago by Brian Bushnell17k
gravatar for Vivek
5.4 years ago by
Vivek2.4k wrote:

seqtk does this

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

ADD COMMENTlink written 5.4 years ago by Vivek2.4k
gravatar for Peter
5.4 years ago by
Scotland, UK
Peter5.8k wrote:

If you want a Galaxy solution, /

This offers non-random sampling (for reproducibility), e.g. take 12% of the sequences, or every 100th sequence.

ADD COMMENTlink written 5.4 years ago by Peter5.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1630 users visited in the last hour