Select sequences from fastq.gz file
3
1
Entering edit mode
9.4 years ago
msameet ▴ 50

Is there a way to sample sequences from the fastq.gz file? I have files that contain hundreds of millions of reads, I want to sample about 100 million reads from each file. Is there a way to do this?

fastq random-selection • 10k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

No, I dont want to extract the files. I saw the question, it assumes that the files are "extracted", no longer a *.gz.

ADD REPLY
6
Entering edit mode
9.4 years ago

BBTools's reformat tool does this, keeping pairs together if they are interleaved:

reformat.sh in=reads.fq.gz out=sampled.fq.gz samplereadstarget=100000000

or for reads in 2 files:

reformat.sh in1=reads1.fq.gz in2=reads2.fq.gz out=sampled1.fq.gz out2=sample2.fq.gz samplereadstarget=100000000

You can also just sample a fraction of the reads without respect to the final number, which is faster since it only needs to read the file once:

reformat.sh in=reads.fq.gz out=sampled.fq.gz samplerate=0.1

or just sample the first X reads, which is faster still:

reformat.sh in=reads.fq.gz out=sampled.fq.gz reads=100000000

Reformat is multithreaded and extremely fast.

ADD COMMENT
0
Entering edit mode

This gives the following error

sh ~/downloads/bbmap/reformat.sh in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000
java -ea -Xmx200m -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz out=sampled_S11-12810Ca_V2.fastq.gz reads=72000000
Exception in thread "main" java.lang.ClassFormatError: jgi.ReformatReads (unrecognized class file version)
   at java.lang.VMClassLoader.defineClass(libgcj.so.10)
   at java.lang.ClassLoader.defineClass(libgcj.so.10)
   at java.security.SecureClassLoader.defineClass(libgcj.so.10)
   at java.net.URLClassLoader.findClass(libgcj.so.10)
   at java.lang.ClassLoader.loadClass(libgcj.so.10)
   at java.lang.ClassLoader.loadClass(libgcj.so.10)
   at gnu.java.lang.MainThread.run(libgcj.so.10)

Is there a way around this?

My java version is

java -version
java version "1.7.0_03"
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode)
ADD REPLY
0
Entering edit mode

That's very strange; it should work with any Java 7 runtime. It seems like a different java is is being called when running the program than when you run java -version, or that Java is configured incorrectly so that you have the executable for Java 7 but you are not using the standard libraries, since it is loading libgcj (from gnu) rather than the Oracle libraries.

You can try the latest Java 6 build, here, which should work, or else you can install Oracle's Java 7 JRE.

It's also possible that manually pathing to the location of Oracle's java executable would work:

/path/to/Java7/jre/bin/java \
  -ea \
  -Xmx200m \
  -cp /ycga-ba/home/sm2556/downloads/bbmap/current/ jgi.ReformatReads \
  in=../S11-12810Ca_005_ACAGTG_L001_R1_V2.fastq.gz \
  out=sampled_S11-12810Ca_V2.fastq.gz \
  reads=72000000

Edit: Update - BBTools works best with Java 8+ now. It is fully compatible with Java 7 and mostly compatible with Java 6. But if your default Java version is 6 (or 7), I suggest you encourage your sysadmins to upgrade to a modern version of Java.

For reference - Java is fully forward and backward compatible, which is one of the things I love about it. However, you cannot use Java libraries written after your last download of Java. So, if I write a program that uses parallel sort (like Clumpify), which is supported in Java 8 but not Java 7... if you run Java 8 it will be faster, and in Java 7 it will be slower, since I wrote custom code to handle that scenario. But for BBNorm, which uses a Java library class specific to Java 7+, you can't even run it in Java 6 since it uses a class that did not exist then.

ADD REPLY
1
Entering edit mode

Yes, that worked. Thanks.

ADD REPLY
0
Entering edit mode

Hi, do you have any estimates on time taken?

ADD REPLY
0
Entering edit mode

The time depends on the speed of the disk subsystem and whether you have pigz installed, which accelerates compression and decompression. But for randomly sampling an exact number of reads, Reformat takes the amount of time needed to read the file twice; for sampling X fraction of the reads, it reads the file once; and for just sampling the first X reads, the amount of time depends on X.

If you have a 10 GB raw file, it could be processed in about 20 seconds (if your disk system is fast enough). A 10 GB raw file that has been gzipped would take more like 125 seconds, assuming decompression at around 80MB/s.

ADD REPLY
3
Entering edit mode
9.4 years ago
Vivek ★ 2.7k

seqtk does this

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq
ADD COMMENT
0
Entering edit mode
9.4 years ago
Peter 6.0k

If you want a Galaxy solution, http://toolshed.g2.bx.psu.edu/view/peterjc/sample_seqs / https://github.com/peterjc/pico_galaxy/tree/master/tools/sample_seqs

This offers non-random sampling (for reproducibility), e.g. take 12% of the sequences, or every 100th sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 2700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6