Question: Selecting Random Pairs From Fastq?
gravatar for Ketil
7.3 years ago by
Ketil3.9k wrote:

I'd like to select a random subset of Illumina PE reads from a set of files. I realize this is pretty trivial to do, but I thought I'd ask if there's a simple program available to do so? It's important that it retains both ends of each pair, and in the right order, and that it can work by streaming over the input sequences (obviously they don't fit in RAM).

Or do I just have to write it myself?

ADD COMMENTlink modified 2.3 years ago by plijnzaad30 • written 7.3 years ago by Ketil3.9k

Not sure I follow - are you saying a random sampling might introduce bias that wasn't there already? This seems to go against the definition of 'random'.

ADD REPLYlink written 7.3 years ago by Ketil3.9k

Do you expect your coverage to be even, otherwise you might get in hot water by introducing biases when randomly sampling over the whole genome/region:

ADD REPLYlink written 7.3 years ago by Allpowerde1.2k
gravatar for lh3
6.3 years ago by
United States
lh331k wrote:

An old question. With seqtk, to sample a fixed number of reads:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

Remember to use the same random seed "-s". It uses reservoir sampling, the same as R::yield(), which requires to keep 10000 records in memory in the above example. For sampling a large number of sequences, using a fraction is preferred as this method only keeps 1 record in memory. Here is an example:

seqtk sample -s100 read1.fq 0.1 > sub1.fq
seqtk sample -s100 read2.fq 0.1 > sub2.fq
ADD COMMENTlink written 6.3 years ago by lh331k

Easy to use and effective. Thanks a lot.

ADD REPLYlink written 5.3 years ago by Biomonika (Noolean)3.0k

is seqtk sample using sampling by replacement or without replacement?

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by epigene400

I think it is safe to assume that the sampling is without replacement.

This is because the name of the sequence is not changed moreover most aligners require that sequences that belong to different templates be named differently

ADD REPLYlink written 4.3 years ago by Istvan Albert ♦♦ 77k

I couldn't find much information about the seqtk sample function, and I was wondering if anyone knows what happens when you choose a seed value which is too large for your original number of reads, considering the target size you set for your subsample.

E.g. fastq file contains 15,000,000 reads, I want to sub set a group of 10,000,000 reads from those, using a seed value of 1,000 (because I have too many other files to process together, and some are way bigger, so I thought I could keep the seed value). Do you think it would start reading again from the beginning of the file when it reaches the end, causing duplication of reads?

I tested this and setqk generated the 10,000,000 reads subset, I fastqc'd the subset and the original files and there wasn't strong differences in duplication (even though it would have reached the end of the original file after only 15,000 reads).

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by rodrigo.duarte8810

I am having an issue using seqtk to subsample some illumina reads prior to assembly with MIRA.  I got the following message after I tried to subsample 1,000,000 reads from an illumina fastq paired end file:

"MIRA found duplicate read names in your data (see log above for more info).  *
*                                                                              *
* This should never, never be!"


When I subsample with seqtk is there a chance that it will choose the same sequence twice in a subsampling?  Just FYI, I am subsampling 1,000,000 reads from an 8,000,000 read paired end file set.

ADD REPLYlink written 3.1 years ago by julestrachsel20

It sounds like you're passing in a file that has interleaved reads. Is that true? In that case, you might get randomly get a forward and reverse pair right next to each other.

If so, you need to use seqtk on both the forward and reverse files independently, then interleave them.

ADD REPLYlink written 2.0 years ago by conchoecia0

If I want to subsample 10m, 20m and 50m reads from a 70m reads file of fwd and rev, should I use different seed values for different read depts? I want to have same read pairs in fwd and rev, but random read pairs in different read depths. Also does it accept gzipped files?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by rmf280

I'm hoping you've solved your problem in the last four months, but for future reference, I was wondering the same thing and YES you do have to change the seed or else it will generate the same 'random' sample.

ADD REPLYlink written 21 months ago by datascientist28350

are the sub sampled reads the same in file1 and file2? if not, is there a way to take from the first sub sampled file reads, the paired read of the second one?

ADD REPLYlink written 13 months ago by fusion.slope150

Yes, as long as you use the same -s.

ADD REPLYlink written 13 months ago by lh331k

What is the function of -s??

ADD REPLYlink written 10 months ago by KVC_bioinfo310

That's random seed. Same seed ensures same "random" results (useful for reproducibility).

ADD REPLYlink written 10 months ago by Biomonika (Noolean)3.0k

Thank you. But what does the number after -s in command line represent?

ADD REPLYlink written 10 months ago by KVC_bioinfo310

he told you that is the random seed. Something that is used in any initial simulation step. In R is set.seed for example before perform simulation..

ADD REPLYlink written 10 months ago by fusion.slope150

here set -s is 100, then the random was selected from what rang? If the number larger or smaller , there was difference?

ADD REPLYlink written 5 days ago by suyanan240
gravatar for Pierre Lindenbaum
7.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:

In one command line:

paste f1.fastq f2.fastq |\ #merge the two fastqs
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\ #merge by group of 4 lines
shuf  |\ #shuffle
head |\ #only 10 records
sed 's/\t\t/\n/g' |\ #restore the delimiters
awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files.
ADD COMMENTlink written 7.3 years ago by Pierre Lindenbaum109k

never heard of shuf before - turns out I have it (Fedora) - will the wonders ever cease

ADD REPLYlink written 7.3 years ago by Istvan Albert ♦♦ 77k

Hi This script seems to be a good solution. However I had a small problem - The fastq headers of the second file are deleted partially. Any solution ? Regards, P

ADD REPLYlink written 5.1 years ago by paulorapazote40

I believe that I was able to solve your problem, paulorapazote, by making the "\t" field delimiter explicit for awk in the last line (with the -F option), i.e.: awk -F '\t' '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by russianconcussion20

If "shuf" standard?

ADD REPLYlink written 7.3 years ago by Aaronquinlan10k

I found it on all the linux systems I used.

ADD REPLYlink written 7.3 years ago by Pierre Lindenbaum109k

Huh. It's not on any of mine, but I do see it's a GNU util. Thanks...learned something new!

ADD REPLYlink written 7.3 years ago by Aaronquinlan10k

Likewise. I wrote my own years ago b/c I understood there not to be such a GNU util.

ADD REPLYlink written 7.3 years ago by Aaronquinlan10k

Hi Pierre, I am curious, how do you know the right pairs are being paired? is that done by the merge function? thanks.

ADD REPLYlink written 5.3 years ago by dfernan630

what is the "right" pairs ? in my script , two files are temporarily merged into one file.

ADD REPLYlink written 5.3 years ago by Pierre Lindenbaum109k

well, then the pairs are in the same order in each file, but that makes sense, thanks!

ADD REPLYlink written 5.3 years ago by dfernan630
gravatar for Martin Morgan
7.3 years ago by
Martin Morgan1.5k
United States
Martin Morgan1.5k wrote:

In Bioconductor, the ShortRead package has

f1 <- FastqSampler("file1.fastq", n=1e6)
f2 <- FastqSampler("file2.fastq", n=1e6)

set.seed(123L); p1 <- yield(f1)
set.seed(123L); p2 <- yield(f2)

assuming reads are ordered identically in each file. FastqSampler subsets the file in a single pass, keeping the ith record if i <= n or a random (0, 1] deviate is less than n / i; in the latter case the ith read replaces a randomly chosen read in the n currently saved (actually, the implementation does this on chunks of reads, in line with how R works best).

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Martin Morgan1.5k
gravatar for brentp
7.3 years ago by
Salt Lake City, UT
brentp22k wrote:

I had some code to do this for single end here:

you can easily make that work for paired end by iterating over fq1 and fq2 in parallel. Actually, I edited it so it will do as you suggest and paste below.

you call it like:

$ python /path/to/reads_1.fastq /path/to/reads_2.fastq 100

and it will create new files with the same name as your paired files, but ending in .subset and containing 100 random records.

import random
import sys

def write_random_records(fqa, fqb, N=100000):
    """ get N random headers from a fastq file without reading the
    whole thing into memory"""
    records = sum(1 for _ in open(fqa)) / 4
    rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])

    fha, fhb = open(fqa),  open(fqb)
    suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
    rec_no = - 1
    for rr in rand_records:

        while rec_no < rr:
            rec_no += 1       
            for i in range(4): fha.readline()
            for i in range(4): fhb.readline()
        for i in range(4):
        rec_no += 1 # (thanks @anderwo)

    print >>sys.stderr, "wrote to %s, %s" %,

if __name__ == "__main__":
    N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
    write_random_records(sys.argv[1], sys.argv[2], N)
ADD COMMENTlink modified 5.8 years ago by David Quigley11k • written 7.3 years ago by brentp22k

I think rand_records = sorted(random.sample(xrange(records), int(N))) Would do the job.

ADD REPLYlink written 6.4 years ago by Mitchell40

This bit is slick. Very nice.

records = sum(1 for _ in open(fqa)) / 4 rand_records = sorted([random.randint(0, records) for _ in xrange(N)])

ADD REPLYlink written 7.3 years ago by Aaronquinlan10k

thanks. it works nicely if you're not worried about leaving open file-handles (which i am not).

ADD REPLYlink written 7.3 years ago by brentp22k

Thanks for the script. I'm pretty sure the following lines should be altered Otherwise it returns the wrong number of reads

1) rand_records = sorted([random.randint(1, records) for _ in xrange(N)]) 2) rec_no = 0 3)for i in range(4): rec_no += 1
suba.write(fha.readline()) subb.write(

ADD REPLYlink written 7.2 years ago by aunderwo0

@aunderwo, I changed it to do random.randint(0, records - 1), I think everything else works as it should.

ADD REPLYlink written 7.2 years ago by brentp22k

@brentp, you didn't comment on 2) from @aunderwo. I also agree that rec_no should be initialized 0 instead of -1. Right?

ADD REPLYlink written 4.8 years ago by Carlos Borroto1.7k

@brentp Thanks for the update. I still think you need the rec_no += 1 after the for i in range(4): otherwise over time it returns less records then specified. I have tested this with a range of N

ADD REPLYlink written 7.2 years ago by aunderwo0

@anderwo Thanks. You are correct. Updated. BTW, I have been using this: which does everything correctly (now that I just adjusted the upper bound to randint). It even checks to make sure it writes the correct number of records.

ADD REPLYlink written 7.2 years ago by brentp22k

@brentp: thanks for this. One potential issue when generating the list of record indices is that the same index can appear multiple times. Something like this would enforce uniqueness:

rand_records = [] while len(rand_records) < N: r = random.randint(0,records-1) if r not in rand_records: rand_records.append(r) rand_records = sorted(rand_records)

but isn't very elegant :(

ADD REPLYlink written 6.6 years ago by Peter Briggs20
gravatar for Aaronquinlan
7.3 years ago by
United States
Aaronquinlan10k wrote:

In the past, I've used the following paste/awk trick to combine the two files into a single line, assign a random number to each pair, sort by the random number, then choose the first N pairs you wish.

One command line, but broken up with comments for clarity.

# Starting FASTQ files
export FQ1=1.fq
export FQ2=2.fq

# The names of the random subsets you wish to create
export FQ1SUBSET=1.rand.fq
export FQ2SUBSET=2.rand.fq

# How many random pairs do we want?
export N=100

# paste the two FASTQ such that the 
# header, seqs, seps, and quals occur "next" to one another
  paste $FQ1 $FQ2 | \
# "linearize" the two mates into a single record.  Add a random number to the front of each line
  awk 'BEGIN{srand()}; {OFS="\t"; \
                        getline seqs; getline sep; getline quals; \
                        print rand(),$0,seqs,sep,quals}' | \
# sort by the random number
  sort -k1,1 | \
# grab the first N records
  head -n $N | \
# Convert the stream back to 2 separate FASTQ files.
  awk '{OFS="\n"; \
        print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
        print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'
ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Aaronquinlan10k

Hoopss, I've just realized that my answer was very close to yours

ADD REPLYlink written 7.3 years ago by Pierre Lindenbaum109k

Thanks Aaron!!!

ADD REPLYlink written 5.0 years ago by Michele Busby1.8k
gravatar for Ketil
7.3 years ago by
Ketil3.9k wrote:

Thanks for all the answers. Being bored, I also wrote my own program for this (, but the examples above illustrates some nice techniques to achieve this using other tools, so I hope you'll forgive me for also attacking the problem myself.

Most solutions shuffle the files and takes the heads, I expect that brentp's solution of only sorting the chosen indices (if I understand correctly) will be more efficient, even if Python is likely to have a slower sort/random than Unix tools (shuf and sort).

I also like the trick by Martin Morgan of running a random sampling routine twice with the same initial seed. In my own implementation, I also select each read based on a random value, but I work on pairs of files simultaneously, which requires more plumbing.

Again, thanks to everybody who contributed!

ADD COMMENTlink written 7.3 years ago by Ketil3.9k

+1 for the open source program

ADD REPLYlink written 7.3 years ago by Pierre Lindenbaum109k
gravatar for chris.mit7
6.3 years ago by
United States
chris.mit760 wrote:

Bit old, but here's my solution

import random
f = open('CD4CD45RO_GATCAG_s_1_1_sequence.txt')
f2 = open('CD4CD45RO_GATCAG_s_1_2_sequence.txt')
o ='.sample', 'wb')
o2 ='.sample', 'wb')
row = f.readline()
sampleSize = 1000000
tSize = 66000000
f2pos = []
p = float(sampleSize)/float(tSize)
while row:
    if row[0] == '@':
        if random.random()<p:
            for i in xrange(3):
    row = f.readline()
for pos in f2pos:
    for i in xrange(4):

I'm not sure how long the others take, this one is done in like ~10 minutes on 66 million reads.

ADD COMMENTlink modified 3.9 years ago • written 6.3 years ago by chris.mit760
gravatar for Alex Reynolds
3.9 years ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

You could use sample:

$ sample -k 1000 -l 4 huge_file.fastq > 1k_sample.fq

This uses reservoir sampling on newline offsets to reduce memory overhead. If your system is thrashing for lack of system memory, this will help a great deal by reducing the memory overhead required for sampling as compared with other tools.

Including GNU sort, which will often run out of memory on genome-scale inputs, because, while it offers reservoir sampling, it stores the entire line in memory, for every line of input. For genome-scale work, this is often simply too much data.

Specify -k N for the number N of samples, and -l L for number of lines per record-to-sample (L). In the case of fastq, that would be four lines per record, so -l 4.

Add -d S to specify a seed value S, if desired, or it is drawn from a Mersenne Twister PRNG.

Generating your own random seed and applying that same seed when sampling two fastq files will let you grab the same samples from two files of paired reads:

$ sample -k 1000 -l 4 -s 1234 first_pair.fastq > 1k_first_pair_sample.fq
$ sample -k 1000 -l 4 -s 1234 second_pair.fastq > 1k_second_pair_sample.fq

If you have one interleaved file, then you could specify -l 8 to sample a pair of records from every eight lines:

$ sample -k 1000 -l 8 interleaved.fastq > 1k_paired_sample.fq

You can also add -r to sample with replacement; the default is to sample without replacement. Or add -p to print a sample in the order provided by the original input.

ADD COMMENTlink modified 10 months ago • written 3.9 years ago by Alex Reynolds24k
gravatar for Paquola
7.2 years ago by
Paquola0 wrote:

I just saw in the man page that shuf has this option:

   -n, --head-count=COUNT
          output at most COUNT lines

This is potentially much more efficient than permuting the whole file and then | head.

But never mind... Tested it, and shuf puts the whole thing in memory. It's not really different than | head.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Paquola0
gravatar for mondiraray
3.9 years ago by
United States
mondiraray0 wrote:

((Used the HTSeq tool as recommended by Simon))

I also ran out of memory when I tried Aaron, Pierre, and Martin's recommendations. Hope this helps for anyone else also having memory issues. Thanks.


ADD COMMENTlink written 3.9 years ago by mondiraray0
gravatar for plijnzaad
2.3 years ago by
European Union
plijnzaad30 wrote:

Dunno if still relevant, but I remember having written a tool for this that used one pass rejection sampling, see

ADD COMMENTlink written 2.3 years ago by plijnzaad30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1537 users visited in the last hour