Question: Downsampling dataset with more than 60 million reads
gravatar for mike
6.3 years ago by
mike70 wrote:


Which tool can perform downsampling for RNAseq dataset with more than 60 million paired reads in fastq? I have used seqtk but it has issues with large memory consumptions.


rna-seq next-gen • 14k views
ADD COMMENTlink modified 2.5 years ago by vmicrobio260 • written 6.3 years ago by mike70
gravatar for lh3
6.3 years ago by
United States
lh332k wrote:

The seqtk <> downsampling.

1. Downsample a fraction of reads (useful when you know the total number):

seqtk sample read1.fa.gz 0.2 > sub1.fa
seqtk sample read2.fa.gz 0.2 > sub2.fa

It takes little memory.

2. Downsample a fixed number (streaming mode):

seqtk sample read1.fa 20000 > sub1.fa
seqtk sample read2.fa 20000 > sub2.fa

It uses the reservoir sampling on a stream. The peak RAM equals the size of output, independent of the input.

3. Downsample a fixed number (2-pass mode):

seqtk sample -2 read1.fa.gz 20000 > sub1.fa
seqtk sample -2 read2.fa.gz 20000 > sub2.fa

It reads the input twice (so twice as slow). In the first pass, it finds the sampled read indices. In the second pass, it outputs reads at the stored indices. The peak RAM is about the number of sampled reads multiplied by 24, again independent of the input. You need the latest seqtk for this 2-pass mode.

ADD COMMENTlink written 6.3 years ago by lh332k
gravatar for Brian Bushnell
4.4 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote: from the BBMap package can also do exact subsampling using a small, fixed amount of memory. The normal mode reads the file once and samples at a fixed rate (e.g. samplerate=0.1 will output about 10% of the reads). Exact mode can be used with "samplereadstarget" or "samplebasestarget". Each will read the file twice, and output exactly that number of reads or bases randomly sampled from throughout the file (obviously, the number of bases has granularity of the read length; it's not going to output half a read). It does not require reads be uniform length, and can handle gzipped input, interleaved or paired fastqs, multiline fasta, etc.

To sample 10% of the reads: in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

or more concisely: in=reads#.fq out=sampled#.fq samplerate=0.1

and for exact sampling: in=reads#.fq out=sampled#.fq samplereadstarget=100k
ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Brian Bushnell17k
gravatar for Irsan
6.3 years ago by
Irsan7.2k wrote:

put this in a file called


# Usage: python forward.fastq reverse.fastq 20000

import random
import sys

def write_random_records(fqa, fqb, N=100000):
    """ get N random headers from a fastq file without reading the
    whole thing into memory"""
    records = sum(1 for _ in open(fqa)) / 4
    rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])

    fha, fhb = open(fqa),  open(fqb)
    suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
    rec_no = - 1
    for rr in rand_records:

        while rec_no < rr:
            rec_no += 1       
            for i in range(4): fha.readline()
            for i in range(4): fhb.readline()
        for i in range(4):
        rec_no += 1 # (thanks @anderwo)

    print >>sys.stderr, "wrote to %s, %s" % (,

if __name__ == "__main__":
    N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
    write_random_records(sys.argv[1], sys.argv[2], N)

And use it on unzipped fastq-files like this to get 1M reads:

[your prompt]$ python forward.fastq reverse.fastq 1000000
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Irsan7.2k

this will produce a sample with replacement, meaning that the same record may be selected more than once. There are two simpler solutions.

One is to decide the probability of printing a record and iterate with that, this will not ensure exact number of records but may be just fine for downsampling and would produce results of the same type as the above. The code would be simpler  and more efficient as it would not need to sort the array.

If the exact number of records is desired then a better solution would be to shuffle the record numbers, slice the top N then sort them and proceed with the rest of the code above.

See also the post: Selecting Random Pairs From Fastq?

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Istvan Albert ♦♦ 85k

Thanks for the code Irsan. Will I get downsampled data in fastq format?

ADD REPLYlink written 6.3 years ago by mike70

One little problem with this code - you never close fqa and fqb files. Either use: with open(fqa) as suba: ... or call suba.close() at the end of write_random_records.

Also this implementation allows for picking single record multiple times. That means that if you'll try to sub-sample to 50% size, it is almost certain that some fastq reads will be duplicated.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by martin.triska130
gravatar for Alex Reynolds
6.3 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

I wrote a tool in C called sample that was designed to get past memory limitations in GNU shuf, which can choke on shuffling inputs of the scale dealt with in genomic studies. I haven't compared it with seqtk, but perhaps sample might be of use to you for FASTQ and other large-scale genomic datasets.

The sample program stores an eight-byte integer for every location of a per-line or per-multiple-line element in a text-formatted input file. These locations are shuffled and a uniformly-random sample is taken of the desired sample size.

For a, say, 4 GiB computer that can allocate 3.5 GiB of memory to sample, up to 470M four-line FASTQ records can be sampled without replacement (235M, with replacement):

$ sample --lines-per-offset=4 --sample-size=${N} reads.fastq > sample.fastq

If the FASTQ records are paired, paired samples can be derived by making a FASTQ file where one record is followed immediately by its pair, and then using sample with --lines-per-offset=8.

The increased lines-per-offset parameter allows sampling (without replacement) a paired input file containing up to 940M paired records (470M, with replacement) on a 4 GiB system allocating 3.5 GiB to sample:

$ sample --lines-per-offset=8 --sample-size=${N} paired_reads.fastq > paired_sample.fastq

The file paired_reads.fastq can be created by linearizing the two FASTQ files, using paste to interleave them with a newline delimiter, and then delinearizing the resulting sample.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Alex Reynolds31k
gravatar for Fred
6.3 years ago by
Paris, France
Fred750 wrote:

Maybe something like that could help:

use strict;
use warnings;
# Usage ./ <fastq r1> <fastq r2> <outFastq r1> <outFastq r2> <prob of keeping reads>
my $proba = $ARGV[4];
my $line1;
my $line2;
my $nbLines = 1;
my $random;
my $fqRecord1;
my $fqRecord2;
        $random = rand(1);
        if($random <= $proba){
            print FASTQOUTF $fqRecord1;
            print FASTQOUTR $fqRecord2;

It takes the two FastQ files as input, the names of output Files, and a probability of taking each read. The probability may be computed as: number of desired reads / total number of reads.

If you want to apply this script on gzipped files:

./ <(gunzip -c f.fastq.gz) <(gunzip -c r.fastq.gz) >(gzip -c - > f_sample.fastq.gz) >(gzip -c - > r_sample.fastq.gz) 0.5
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Fred750

Yep nice idea, for 60M reads and 1M sample it would give +/-1000 reads deviation.

ADD REPLYlink written 6.3 years ago by mikhail.shugay3.4k
gravatar for Matt Shirley
4.4 years ago by
Matt Shirley9.4k
Cambridge, MA
Matt Shirley9.4k wrote:

If you want a uniform sample over the entire file, and if all the reads are the same length in both files you can use a tool I've developed that is about as efficient as possible: mdshw5/strandex. It takes the file size, determines offsets within the file to start from, and matches FASTQ entries occurring after those offsets. This way you're reading only slightly more data that you sample, and an exact number of reads can be specified. Memory usage is <1MB and sampling time scales linearly with the number of reads sampled.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Matt Shirley9.4k
gravatar for vmicrobio
2.5 years ago by
vmicrobio260 wrote:

you may try something in awk :

cat file.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) {printf("\n");} else { printf("\t");} }' | 
awk -v k=8000 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | 
awk -F"\t" '{print $1"\n"$2"\n"$3"\n"$4 > "downsamp-file.fastq"}'
ADD COMMENTlink written 2.5 years ago by vmicrobio260
gravatar for dariober
6.3 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

Picard DownsampleSam?

ADD COMMENTlink modified 13 months ago by _r_am31k • written 6.3 years ago by dariober11k

thanks but I want to downsample data at fastq level

ADD REPLYlink written 6.3 years ago by mike70

My apologies, I haven't read the question carefully enough.

ADD REPLYlink written 6.3 years ago by dariober11k
gravatar for SES
6.3 years ago by
Vancouver, BC
SES8.4k wrote:

If you specify the fraction of reads to sample with seqtk instead of the number, then only one read will be kept in memory at a time. For example,

seqtk sample -s100 read1.fq 0.1 > sub1.fq

That is not clear from the documentation, but the author does indicate this in another biostar thread ( A: Selecting Random Pairs From Fastq? ).

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by SES8.4k

I am down sampling data based on lowest number of reads among paired fastq files. I don't think I can use fraction in this case.

ADD REPLYlink written 6.3 years ago by mike70

I'm not sure I understand what you mean. In your question you stated that you have 60 million paired sequences in fastq, and if you have a method for determining the lowest number you need, then just convert that to a fraction.

ADD REPLYlink written 6.3 years ago by SES8.4k

Just added a 2-pass mode to seqtk to trade speed for smaller peak memory.

ADD REPLYlink written 6.3 years ago by lh332k

Yes, for 60 million, fraction is preferred. 60 million 100bp reads would require at least 60M*100*2=12GB memory, plus the memory taken by the read names. There are ways to significantly reduce the memory with two-pass file reading.

ADD REPLYlink written 6.3 years ago by lh332k
gravatar for mikhail.shugay
6.3 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

My 5 cents. 

First usually FASTQ is assumed unsorted so you can take just the first X reads.

Then the general solution (Java) is to create an array of 1:60 000 000 and then use Collections.shuffle, take first X numbers and check on each 4 lines of FASTQ check if those are in those first X shuffled numbers. Here is the example using Groovy, it is not using more than 4GB RAM for it and takes several seconds on a laptop to generate the index (lineNumbersFilter). Main limiting factor is I/O speed as hash search is quite fast.

int nLines = 60000000, sampleSize = 10000000

def fastq1 = "R1.fastq", fastq2 = "R2.fastq", out1 = "R1_ds.fastq", out2 = "R2_ds.fastq"

def lineNumbers = new ArrayList<Integer>(0..<nLines)


def lineNumbersFilter = new HashSet<Integer>(lineNumbers[0..<sampleSize])

int n = 0

new File(out1).withPrintWriter { pw1 ->

    new File(out2).withPrintWriter { pw2 ->

        new File(fastq1).withReader { reader1 ->

            new File(fastq2).withReader { reader2 ->

                def read1, read2

                while (((read1 = reader1.readLine()) != null) &&
                        ((read2 = reader2.readLine()) != null)) {

                    read1 += "\n" + reader1.readLine() +
                            "\n" + reader1.readLine() +
                            "\n" + reader1.readLine()

                    read2 += "\n" + reader2.readLine() +
                            "\n " + reader2.readLine() +
                            "\n" + reader2.readLine()

                    if (lineNumbersFilter.contains(n)) {

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by mikhail.shugay3.4k

I *think* reads in a fastq from Illumina sequencers are in the order as they appear on the lane of the flow cell. So they are randomized in terms of genomic position but technically they might be biased, which is the reason why the first reads are usually bad...

ADD REPLYlink written 6.3 years ago by dariober11k

Sure it won't protect from flow cell artifacts, however it usually works fine and only first several reads are of very bad quality.

ADD REPLYlink written 6.3 years ago by mikhail.shugay3.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2145 users visited in the last hour