Question: Select sequences from fastq.gz file
1
gravatar for msameet
4.4 years ago by
msameet30
United States
msameet30 wrote:

Is there a way to sample sequences from the `fastq.gz` file?  I have files that contain hundreds of millions of reads, I want to sample about 100 million reads from each file.  Is there a way to do this?

ADD COMMENTlink modified 4.4 years ago by Peter5.8k • written 4.4 years ago by msameet30

duplicate of

Selecting Random Pairs From Fastq?

How To Randomly Select 20M Reads From A 200M Fastq File

 

ADD REPLYlink written 4.4 years ago by Pierre Lindenbaum119k

No, I dont want to extract the files.  I saw the question, it assumes that the files are "extracted", no longer a *.gz.

ADD REPLYlink written 4.4 years ago by msameet30
5
gravatar for Brian Bushnell
4.4 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBTools's reformat tool does this, keeping pairs together if they are interleaved:

reformat.sh in=reads.fq.gz out=sampled.fq.gz samplereadstarget=100000000

or for reads in 2 files:

reformat.sh in1=reads1.fq.gz in2=reads2.fq.gz out=sampled1.fq.gz out2=sample2.fq.gz samplereadstarget=100000000

You can also just sample a fraction of the reads without respect to the final number, which is faster since it only needs to read the file once:

reformat.sh in=reads.fq.gz out=sampled.fq.gz samplerate=0.1

or just sample the first X reads, which is faster still:

reformat.sh in=reads.fq.gz out=sampled.fq.gz reads=100000000

Reformat is multithreaded and extremely fast.

ADD COMMENTlink written 4.4 years ago by Brian Bushnell16k

This gives the following error

sh ~/downloads/bbmap/reformat.sh in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000

java -ea -Xmx200m -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000

Exception in thread "main" java.lang.ClassFormatError: jgi.ReformatReads (unrecognized class file version)

   at java.lang.VMClassLoader.defineClass(libgcj.so.10)

   at java.lang.ClassLoader.defineClass(libgcj.so.10)

   at java.security.SecureClassLoader.defineClass(libgcj.so.10)

   at java.net.URLClassLoader.findClass(libgcj.so.10)

   at java.lang.ClassLoader.loadClass(libgcj.so.10)

   at java.lang.ClassLoader.loadClass(libgcj.so.10)

   at gnu.java.lang.MainThread.run(libgcj.so.10)

Is there a way around this?

My java version is 

java -version

java version "1.7.0_03"

Java(TM) SE Runtime Environment (build 1.7.0_03-b04)

Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode)

ADD REPLYlink written 4.4 years ago by msameet30

That's very strange; it should work with any Java 7 runtime.  It seems like a different java is is being called when running the program than when you run java -version, or that Java is configured incorrectly so that you have the executable for Java 7 but you are not using the standard libraries, since it is loading libgcj (from gnu) rather than the Oracle libraries.

You can try the latest Java 6 build, here, which should work, or else you can install Oracle's Java 7 JRE.

It's also possible that manually pathing to the location of Oracle's java executable would work:

/path/to/Java7/jre/bin/java -ea -Xmx200m -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000

 

Edit: Update - BBTools works best with Java 8+ now. It is fully compatible with Java 7 and mostly compatible with Java 6. But if your default Java version is 6 (or 7), I suggest you encourage your sysadmins to upgrade to a modern version of Java.

For reference - Java is fully forward and backward compatible, which is one of the things I love about it. However, you cannot use Java libraries written after your last download of Java. So, if I write a program that uses parallel sort (like Clumpify), which is supported in Java 8 but not Java 7... if you run Java 8 it will be faster, and in Java 7 it will be slower, since I wrote custom code to handle that scenario. But for BBNorm, which uses a Java library class specific to Java 7+, you can't even run it in Java 6 since it uses a class that did not exist then.

ADD REPLYlink modified 21 months ago • written 4.4 years ago by Brian Bushnell16k
1

Yes, that worked. Thanks.

ADD REPLYlink written 4.4 years ago by msameet30

Hi, do you have any estimates on time taken?

ADD REPLYlink written 4.4 years ago by msameet30

The time depends on the speed of the disk subsystem and whether you have pigz installed, which accelerates compression and decompression.  But for randomly sampling an exact number of reads, Reformat takes the amount of time needed to read the file twice; for sampling X fraction of the reads, it reads the file once; and for just sampling the first X reads, the amount of time depends on X.

If you have a 10 GB raw file, it could be processed in about 20 seconds (if your disk system is fast enough).  A 10 GB raw file that has been gzipped would take more like 125 seconds, assuming decompression at around 80MB/s.

 

ADD REPLYlink written 4.4 years ago by Brian Bushnell16k
2
gravatar for Vivek
4.4 years ago by
Vivek2.2k
Denmark
Vivek2.2k wrote:

seqtk does this

https://github.com/lh3/seqtk

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

ADD COMMENTlink written 4.4 years ago by Vivek2.2k
0
gravatar for Peter
4.4 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

If you want a Galaxy solution, http://toolshed.g2.bx.psu.edu/view/peterjc/sample_seqs / https://github.com/peterjc/pico_galaxy/tree/master/tools/sample_seqs

This offers non-random sampling (for reproducibility), e.g. take 12% of the sequences, or every 100th sequence.

ADD COMMENTlink written 4.4 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 641 users visited in the last hour