1. Downsample a fraction of reads (useful when you know the total number):

Question

Downsampling dataset with more than 60 million reads

6

Entering edit mode

9.7 years ago

mike ▴ 90

Hello,

Which tool can perform downsampling for RNAseq dataset with more than 60 million paired reads in fastq? I have used seqtk but it has issues with large memory consumptions.

Thanks

next-gen RNA-Seq • 32k views

ADD COMMENT • link updated 4 months ago by asalimih ▴ 60 • written 9.7 years ago by mike ▴ 90

0

Entering edit mode

~~Picard DownsampleSam?~~

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 9.7 years ago by dariober 14k

0

Entering edit mode

thanks but I want to downsample data at fastq level

ADD REPLY • link 9.7 years ago by mike ▴ 90

1

Entering edit mode

My apologies, I haven't read the question carefully enough.

ADD REPLY • link 9.7 years ago by dariober 14k

6

Entering edit mode

7.8 years ago

Brian Bushnell 20k

reformat.sh from the BBMap package can also do exact subsampling using a small, fixed amount of memory. The normal mode reads the file once and samples at a fixed rate (e.g. samplerate=0.1 will output about 10% of the reads). Exact mode can be used with "samplereadstarget" or "samplebasestarget". Each will read the file twice, and output exactly that number of reads or bases randomly sampled from throughout the file (obviously, the number of bases has granularity of the read length; it's not going to output half a read). It does not require reads be uniform length, and can handle gzipped input, interleaved or paired fastqs, multiline fasta, etc.

To sample 10% of the reads:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

or more concisely:
reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1

and for exact sampling:
reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

1

Entering edit mode

9.7 years ago

Irsan ★ 7.8k

Put this in a file called sample_N_fastq.py

# Usage: python sample_N_fastq.py forward.fastq reverse.fastq 20000

import random
import sys

def write_random_records(fqa, fqb, N=100000):
    """ get N random headers from a fastq file without reading the
    whole thing into memory"""
    records = sum(1 for _ in open(fqa)) / 4
    rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])

    fha, fhb = open(fqa),  open(fqb)
    suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
    rec_no = - 1
    for rr in rand_records:

        while rec_no < rr:
            rec_no += 1       
            for i in range(4): fha.readline()
            for i in range(4): fhb.readline()
        for i in range(4):
            suba.write(fha.readline())
            subb.write(fhb.readline())
        rec_no += 1 # (thanks @anderwo)

    print >>sys.stderr, "wrote to %s, %s" % (suba.name, subb.name)

if __name__ == "__main__":
    N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
    write_random_records(sys.argv[1], sys.argv[2], N)

And use it on unzipped fastq-files like this to get 1M reads:

[your prompt]$ python sample_N_fastq.py forward.fastq reverse.fastq 1000000

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Irsan ★ 7.8k

2

Entering edit mode

This will produce a sample with replacement, meaning that the same record may be selected more than once. There are two simpler solutions.

One is to decide the probability of printing a record and iterate with that, this will not ensure exact number of records but may be just fine for downsampling and would produce results of the same type as the above. The code would be simpler and more efficient as it would not need to sort the array.

If the exact number of records is desired then a better solution would be to shuffle the record numbers, slice the top N then sort them and proceed with the rest of the code above.

See also the post: Selecting Random Pairs From Fastq?

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks for the code Irsan. Will I get downsampled data in fastq format?

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by mike ▴ 90

0

Entering edit mode

One little problem with this code - you never close fqa and fqb files. Either use: with open(fqa) as suba: ... or call suba.close() at the end of write_random_records.

Also this implementation allows for picking single record multiple times. That means that if you'll try to sub-sample to 50% size, it is almost certain that some fastq reads will be duplicated.

ADD REPLY • link 7.2 years ago by martin.triska ▴ 180

1

Entering edit mode

9.7 years ago

Alex Reynolds 35k

I wrote a tool in C called sample that was designed to get past memory limitations in GNU shuf, which can choke on shuffling inputs of the scale dealt with in genomic studies. I haven't compared it with seqtk, but perhaps sample might be of use to you for FASTQ and other large-scale genomic datasets.

The sample program stores an eight-byte integer for every location of a per-line or per-multiple-line element in a text-formatted input file. These locations are shuffled and a uniformly-random sample is taken of the desired sample size.

For a, say, 4 GiB computer that can allocate 3.5 GiB of memory to sample, up to 470M four-line FASTQ records can be sampled without replacement (235M, with replacement):

$ sample --lines-per-offset=4 --sample-size=${N} reads.fastq > sample.fastq

If the FASTQ records are paired, paired samples can be derived by making a FASTQ file where one record is followed immediately by its pair, and then using sample with --lines-per-offset=8.

The increased lines-per-offset parameter allows sampling (without replacement) a paired input file containing up to 940M paired records (470M, with replacement) on a 4 GiB system allocating 3.5 GiB to sample:

$ sample --lines-per-offset=8 --sample-size=${N} paired_reads.fastq > paired_sample.fastq

The file paired_reads.fastq can be created by linearizing the two FASTQ files, using paste to interleave them with a newline delimiter, and then delinearizing the resulting sample.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Alex Reynolds 35k

1

Entering edit mode

9.7 years ago

Fred ▴ 780

Maybe something like that could help:

#!/usr/bin/perl
use strict;
use warnings;
#
# Usage ./sampleFastq.pl <fastq r1> <fastq r2> <outFastq r1> <outFastq r2> <prob of keeping reads>
#
open(FASTQF,$ARGV[0]);
open(FASTQR,$ARGV[1]);
open(FASTQOUTF,">".$ARGV[2]);
open(FASTQOUTR,">".$ARGV[3]);
my $proba = $ARGV[4];
my $line1;
my $line2;
my $nbLines = 1;
my $random;
my $fqRecord1;
my $fqRecord2;
while($line1=<FASTQF>){
    $line2=<FASTQR>;
    $fqRecord1.=$line1;
    $fqRecord2.=$line2;
    if($nbLines%4==0){
        $random = rand(1);
        if($random <= $proba){
            print FASTQOUTF $fqRecord1;
            print FASTQOUTR $fqRecord2;
        }
        $fqRecord1="";
        $fqRecord2="";
    }
    $nbLines++;
}
close(FASTQOUTR);
close(FASTQOUTF);
close(FASTQR);
close(FASTQF);

It takes the two FastQ files as input, the names of output Files, and a probability of taking each read. The probability may be computed as: number of desired reads / total number of reads.

If you want to apply this script on gzipped files:

./sampleFastq.pl <(gunzip -c f.fastq.gz) <(gunzip -c r.fastq.gz) >(gzip -c - > f_sample.fastq.gz) >(gzip -c - > r_sample.fastq.gz) 0.5

ADD COMMENT • link 9.7 years ago by Fred ▴ 780

1

Entering edit mode

Yep nice idea, for 60M reads and 1M sample it would give +/-1000 reads deviation.

ADD REPLY • link 9.7 years ago by mikhail.shugay 3.5k

1

Entering edit mode

7.8 years ago

Matt Shirley 10k

If you want a uniform sample over the entire file, and if all the reads are the same length in both files you can use a tool I've developed that is about as efficient as possible: mdshw5/strandex. It takes the file size, determines offsets within the file to start from, and matches FASTQ entries occurring after those offsets. This way you're reading only slightly more data that you sample, and an exact number of reads can be specified. Memory usage is <1MB and sampling time scales linearly with the number of reads sampled.

ADD COMMENT • link 7.8 years ago by Matt Shirley 10k

1

Entering edit mode

5.8 years ago

vmicrobio ▴ 290

you may try something in awk :

cat file.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) {printf("\n");} else { printf("\t");} }' | 
awk -v k=8000 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | 
awk -F"\t" '{print $1"\n"$2"\n"$3"\n"$4 > "downsamp-file.fastq"}'

ADD COMMENT • link 5.8 years ago by vmicrobio ▴ 290

0

Entering edit mode

9.7 years ago

SES 8.6k

If you specify the fraction of reads to sample with seqtk instead of the number, then only one read will be kept in memory at a time. For example,

seqtk sample -s100 read1.fq 0.1 > sub1.fq

That is not clear from the documentation, but the author does indicate this in another biostar thread ( A: Selecting Random Pairs From Fastq? ).

ADD COMMENT • link 9.7 years ago by SES 8.6k

0

Entering edit mode

I am down sampling data based on lowest number of reads among paired fastq files. I don't think I can use fraction in this case.

ADD REPLY • link 9.7 years ago by mike ▴ 90

0

Entering edit mode

I'm not sure I understand what you mean. In your question you stated that you have 60 million paired sequences in fastq, and if you have a method for determining the lowest number you need, then just convert that to a fraction.

ADD REPLY • link 9.7 years ago by SES 8.6k

2

Entering edit mode

Just added a 2-pass mode to seqtk to trade speed for smaller peak memory.

ADD REPLY • link 9.7 years ago by lh3 33k

1

Entering edit mode

Yes, for 60 million, fraction is preferred. 60 million 100bp reads would require at least 60M*100*2=12GB memory, plus the memory taken by the read names. There are ways to significantly reduce the memory with two-pass file reading.

ADD REPLY • link 9.7 years ago by lh3 33k

0

Entering edit mode

9.7 years ago

mikhail.shugay 3.5k

My 5 cents.

First usually FASTQ is assumed unsorted so you can take just the first X reads.

Then the general solution (Java) is to create an array of 1:60 000 000 and then use Collections.shuffle, take first X numbers and check on each 4 lines of FASTQ check if those are in those first X shuffled numbers. Here is the example using Groovy, it is not using more than 4GB RAM for it and takes several seconds on a laptop to generate the index (lineNumbersFilter). Main limiting factor is I/O speed as hash search is quite fast.

int nLines = 60000000, sampleSize = 10000000

def fastq1 = "R1.fastq", fastq2 = "R2.fastq", out1 = "R1_ds.fastq", out2 = "R2_ds.fastq"

def lineNumbers = new ArrayList<Integer>(0..<nLines)

Collections.shuffle(lineNumbers)

def lineNumbersFilter = new HashSet<Integer>(lineNumbers[0..<sampleSize])

int n = 0

new File(out1).withPrintWriter { pw1 ->

    new File(out2).withPrintWriter { pw2 ->

        new File(fastq1).withReader { reader1 ->

            new File(fastq2).withReader { reader2 ->

                def read1, read2

                while (((read1 = reader1.readLine()) != null) &&
                        ((read2 = reader2.readLine()) != null)) {

                    read1 += "\n" + reader1.readLine() +
                            "\n" + reader1.readLine() +
                            "\n" + reader1.readLine()

                    read2 += "\n" + reader2.readLine() +
                            "\n " + reader2.readLine() +
                            "\n" + reader2.readLine()

                    if (lineNumbersFilter.contains(n)) {
                        pw1.println(read1)
                        pw2.println(read2)
                    }
                }

                n++
            }
        }
    }
}

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by mikhail.shugay 3.5k

0

Entering edit mode

I *think* reads in a fastq from Illumina sequencers are in the order as they appear on the lane of the flow cell. So they are randomized in terms of genomic position but technically they might be biased, which is the reason why the first reads are usually bad...

ADD REPLY • link 9.7 years ago by dariober 14k

0

Entering edit mode

Sure it won't protect from flow cell artifacts, however it usually works fine and only first several reads are of very bad quality.

ADD REPLY • link 9.7 years ago by mikhail.shugay 3.5k

0

Entering edit mode

2.1 years ago

chunshi • 0

description="\ Example: $0 [number of reads | percentage of reads] e.g. $0 1,000,000 # sampling 1 million reads $0 0.9 # sampling 90% of reads "

if [ "$#" -eq 0 ]; then clear cat <<< "$description" exit 0 fi

seqtk=PATH/TO/seqtk

N=$1 N=$(sed 's/,//g' <<< $N) out_dir="sampling_$N"

mkdir -p $out_dir

for fq in *.gz do echo "processing $fq" q "$seqtk sample -s100 $fq $N | gzip> $out_dir/$fq" sleep 1 done

echo "The output sampling fastq files: ./$out_dir"

ADD COMMENT • link 2.1 years ago by chunshi • 0

Kevin Blighe · Accepted Answer · 2014-08-22

The seqtk downsampling.

1. Downsample a fraction of reads (useful when you know the total number):

seqtk sample read1.fa.gz 0.2 > sub1.fa
seqtk sample read2.fa.gz 0.2 > sub2.fa

It takes little memory.

2. Downsample a fixed number (streaming mode):

seqtk sample read1.fa 20000 > sub1.fa
seqtk sample read2.fa 20000 > sub2.fa

It uses the reservoir sampling on a stream. The peak RAM equals the size of output, independent of the input.

3. Downsample a fixed number (2-pass mode):

seqtk sample -2 read1.fa.gz 20000 > sub1.fa
seqtk sample -2 read2.fa.gz 20000 > sub2.fa

It reads the input twice (so twice as slow). In the first pass, it finds the sampled read indices. In the second pass, it outputs reads at the stored indices. The peak RAM is about the number of sampled reads multiplied by 24, again independent of the input. You need the latest seqtk for this 2-pass mode.