Selecting Random Pairs From Fastq?
11
30
Entering edit mode
11.3 years ago
Ketil 4.1k

I'd like to select a random subset of Illumina PE reads from a set of files. I realize this is pretty trivial to do, but I thought I'd ask if there's a simple program available to do so? It's important that it retains both ends of each pair, and in the right order, and that it can work by streaming over the input sequences (obviously they don't fit in RAM).

Or do I just have to write it myself?

random fastq sequence illumina code • 67k views
3
Entering edit mode

Not sure I follow - are you saying a random sampling might introduce bias that wasn't there already? This seems to go against the definition of 'random'.

0
Entering edit mode

Do you expect your coverage to be even, otherwise you might get in hot water by introducing biases when randomly sampling over the whole genome/region: Downsampling Bam Files

59
Entering edit mode
10.2 years ago
lh3 33k

An old question. With seqtk, to sample a fixed number of reads:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq


Remember to use the same random seed -s. It uses reservoir sampling, the same as R::yield(), which requires to keep 10000 records in memory in the above example. For sampling a large number of sequences, using a fraction is preferred as this method only keeps 1 record in memory. Here is an example:

seqtk sample -s100 read1.fq 0.1 > sub1.fq
seqtk sample -s100 read2.fq 0.1 > sub2.fq

1
Entering edit mode

Easy to use and effective. Thanks a lot.

1
Entering edit mode

is seqtk sample using sampling by replacement or without replacement?

0
Entering edit mode

I think it is safe to assume that the sampling is without replacement.

This is because the name of the sequence is not changed moreover most aligners require that sequences that belong to different templates be named differently

0
Entering edit mode

I couldn't find much information about the seqtk sample function, and I was wondering if anyone knows what happens when you choose a seed value which is too large for your original number of reads, considering the target size you set for your subsample.

E.g. fastq file contains 15,000,000 reads, I want to sub set a group of 10,000,000 reads from those, using a seed value of 1,000 (because I have too many other files to process together, and some are way bigger, so I thought I could keep the seed value). Do you think it would start reading again from the beginning of the file when it reaches the end, causing duplication of reads?

I tested this and setqk generated the 10,000,000 reads subset, I fastqc'd the subset and the original files and there wasn't strong differences in duplication (even though it would have reached the end of the original file after only 15,000 reads).

0
Entering edit mode

I am having an issue using seqtk to subsample some illumina reads prior to assembly with MIRA. I got the following message after I tried to subsample 1,000,000 reads from an illumina fastq paired end file:

MIRA found duplicate read names in your data (see log above for more info).
*                                                                         *
*This should never, never be!


When I subsample with seqtk is there a chance that it will choose the same sequence twice in a subsampling? Just FYI, I am subsampling 1,000,000 reads from an 8,000,000 read paired end file set.

0
Entering edit mode

It sounds like you're passing in a file that has interleaved reads. Is that true? In that case, you might get randomly get a forward and reverse pair right next to each other.

If so, you need to use seqtk on both the forward and reverse files independently, then interleave them.

0
Entering edit mode

If I want to subsample 10m, 20m and 50m reads from a 70m reads file of fwd and rev, should I use different seed values for different read depts? I want to have same read pairs in fwd and rev, but random read pairs in different read depths. Also does it accept gzipped files?

0
Entering edit mode

I'm hoping you've solved your problem in the last four months, but for future reference, I was wondering the same thing and YES you do have to change the seed or else it will generate the same 'random' sample.

0
Entering edit mode

are the sub sampled reads the same in file1 and file2? if not, is there a way to take from the first sub sampled file reads, the paired read of the second one?

1
Entering edit mode

Yes, as long as you use the same -s.

0
Entering edit mode

What is the function of -s?

1
Entering edit mode

That's random seed. Same seed ensures same "random" results (useful for reproducibility).

0
Entering edit mode

Thank you. But what does the number after -s in command line represent?

0
Entering edit mode

he told you that is the random seed. Something that is used in any initial simulation step. In R is set.seed for example before perform simulation..

0
Entering edit mode

here set -s is 100, then the random was selected from what rang? If the number larger or smaller , there was difference?

0
Entering edit mode

Hi! I have two fastq files, each with 2522553 reads in them. I want to do a sub sampling to have only 46485 reads in each fastq file, so I was thinking of using seqtk with the following commands for each fastq file:

seqtk sample -s100 cp_trim_1P.fq 46485 > cp1.100cov.fq

seqtk sample -s100 cp_trim_2P.fq 46485 > cp2.100cov.fq


My doubt is if the random seed number used (-s 100) is the correct for the number of reads I want to sub sample.

0
Entering edit mode

the seed shouldn't influence how many reads are returned; it influences which reads are selected rather than how many are selected (provided you've stated how many reads you want)

0
Entering edit mode

Thank you for your answer. I thought that the seed number have some influence in the number of reads. So using as a seed number 100 (or 4, like my teacher did in an example) should not matter, as long as i use the same seed number for both fastq files?

0
Entering edit mode

Yes, that is correct. The seed only affects how the reads are selected.

0
Entering edit mode

Thank you ningyuan.sg :)

0
Entering edit mode

Is there a way to check if the subsampling has worked? for example, if I write seqtk sample -s100 read1.fq 10000 > sub1.fq , How can I check if the reads have been subsampled to the quantity specified (10000)?

0
Entering edit mode

If all you need are the number of sequences count the number of lines and divide by 4

echo $(cat reads.fq | wc -l)/4 | bc  To actually verify that the sampling was done as claimed, well that would be a much harder task to verify. ADD REPLY 24 Entering edit mode 11.3 years ago In one command line: paste f1.fastq f2.fastq |\ #merge the two fastqs awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\ #merge by group of 4 lines
shuf  |\ #shuffle
sed 's/\t\t/\n/g' |\ #restore the delimiters
awk '{print $1 > "file1.fastq"; print$2 > "file2.fatsq"}' #split in two files.

1
Entering edit mode

never heard of shuf before - turns out I have it (Fedora) - will the wonders ever cease

1
Entering edit mode

Hi This script seems to be a good solution. However I had a small problem - The fastq headers of the second file are deleted partially. Any solution ? Regards, P

1
Entering edit mode

I believe that I was able to solve your problem, paulorapazote, by making the \t field delimiter explicit for awk in the last line (with the -F option), i.e.:

awk -F '\t' '{print $1 > "file1.fastq"; print$2 > "file2.fatsq"}' #split in two files.

0
Entering edit mode

If "shuf" standard?

0
Entering edit mode

I found it on all the linux systems I used.

0
Entering edit mode

Huh. It's not on any of mine, but I do see it's a GNU util. Thanks...learned something new!

0
Entering edit mode

Likewise. I wrote my own years ago b/c I understood there not to be such a GNU util.

0
Entering edit mode

Hi Pierre, I am curious, how do you know the right pairs are being paired? is that done by the merge function? thanks.

0
Entering edit mode

what is the "right" pairs ? in my script , two files are temporarily merged into one file.

0
Entering edit mode

well, then the pairs are in the same order in each file, but that makes sense, thanks!

16
Entering edit mode
11.3 years ago
Martin Morgan ★ 1.6k

In Bioconductor, the ShortRead package has

library(ShortRead)
f1 <- FastqSampler("file1.fastq", n=1e6)
f2 <- FastqSampler("file2.fastq", n=1e6)

set.seed(123L); p1 <- yield(f1)
set.seed(123L); p2 <- yield(f2)


Assuming reads are ordered identically in each file. FastqSampler subsets the file in a single pass, keeping the ith record if i <= n or a random (0, 1] deviate is less than n / i; in the latter case the ith read replaces a randomly chosen read in the n currently saved (actually, the implementation does this on chunks of reads, in line with how R works best).

8
Entering edit mode
11.3 years ago
brentp 24k

I had some code to do this for single end here: https://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py

You can easily make that work for paired end by iterating over fq1 and fq2 in parallel. Actually, I edited it so it will do as you suggest and paste below.

You call it like:

$python get_subset.py /path/to/reads_1.fastq /path/to/reads_2.fastq 100  and it will create new files with the same name as your paired files, but ending in .subset and containing 100 random records. import random import sys def write_random_records(fqa, fqb, N=100000): """ get N random headers from a fastq file without reading the whole thing into memory""" records = sum(1 for _ in open(fqa)) / 4 rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)]) fha, fhb = open(fqa), open(fqb) suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w") rec_no = - 1 for rr in rand_records: while rec_no < rr: rec_no += 1 for i in range(4): fha.readline() for i in range(4): fhb.readline() for i in range(4): suba.write(fha.readline()) subb.write(fhb.readline()) rec_no += 1 # (thanks @anderwo) print >>sys.stderr, "wrote to %s, %s" % suba.name, subb.name) if __name__ == "__main__": N = 100 if len(sys.argv) < 4 else int(sys.argv[3]) write_random_records(sys.argv[1], sys.argv[2], N)  ADD COMMENT 2 Entering edit mode I think rand_records = sorted(random.sample(xrange(records), int(N))) Would do the job. ADD REPLY 0 Entering edit mode This bit is slick. Very nice. records = sum(1 for _ in open(fqa)) / 4 rand_records = sorted([random.randint(0, records) for _ in xrange(N)])  ADD REPLY 0 Entering edit mode thanks. it works nicely if you're not worried about leaving open file-handles (which i am not). ADD REPLY 0 Entering edit mode Thanks for the script. I'm pretty sure the following lines should be altered Otherwise it returns the wrong number of reads 1. rand_records = sorted([random.randint(1, records) for _ in xrange(N)]) 2. rec_no = 0 3. for i in range(4): rec_no += 1 suba.write(fha.readline()) subb.write(fhb.read  ADD REPLY 0 Entering edit mode @aunderwo, I changed it to do random.randint(0, records - 1), I think everything else works as it should. ADD REPLY 0 Entering edit mode @brentp, you didn't comment on 2) from @aunderwo. I also agree that rec_no should be initialized 0 instead of -1. Right? ADD REPLY 0 Entering edit mode @brentp Thanks for the update. I still think you need the rec_no += 1 after the for i in range(4): otherwise over time it returns less records then specified. I have tested this with a range of N ADD REPLY 0 Entering edit mode @anderwo Thanks. You are correct. Updated. BTW, I have been using this which does everything correctly (now that I just adjusted the upper bound to randint). It even checks to make sure it writes the correct number of records. ADD REPLY 0 Entering edit mode @brentp: thanks for this. One potential issue when generating the list of record indices is that the same index can appear multiple times. Something like this would enforce uniqueness: rand_records = [] while len(rand_records) &lt; N: r = random.randint(0,records-1) if r not in rand_records: rand_records.append(r) rand_records = sorted(rand_records)  but isn't very elegant :( ADD REPLY 0 Entering edit mode What about: rand_records = random.sample(range(0, records), N) rand_records.sort()  random.sample performs random selection without replacement, so the check you've added above would just be built-in to the call. ADD REPLY 7 Entering edit mode 11.3 years ago In the past, I've used the following paste/awk trick to combine the two files into a single line, assign a random number to each pair, sort by the random number, then choose the first N pairs you wish. One command line, but broken up with comments for clarity. # Starting FASTQ files export FQ1=1.fq export FQ2=2.fq # The names of the random subsets you wish to create export FQ1SUBSET=1.rand.fq export FQ2SUBSET=2.rand.fq # How many random pairs do we want? export N=100 # paste the two FASTQ such that the # header, seqs, seps, and quals occur "next" to one another paste$FQ1 $FQ2 | \ # "linearize" the two mates into a single record. Add a random number to the front of each line awk 'BEGIN{srand()}; {OFS="\t"; \ getline seqs; getline sep; getline quals; \ print rand(),$0,seqs,sep,quals}' | \
# sort by the random number
sort -k1,1 | \
# grab the first N records
head -n $N | \ # Convert the stream back to 2 separate FASTQ files. awk '{OFS="\n"; \ print$2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \ print$3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'  ADD COMMENT 0 Entering edit mode Hoopss, I've just realized that my answer was very close to yours ADD REPLY 0 Entering edit mode Thanks Aaron!!! ADD REPLY 4 Entering edit mode 11.3 years ago Ketil 4.1k Thanks for all the answers. Being bored, I also wrote my own program for this (http://malde.org/~ketil/biohaskell/biolib/examples/RSelectPE.hs), but the examples above illustrates some nice techniques to achieve this using other tools, so I hope you'll forgive me for also attacking the problem myself. Most solutions shuffle the files and takes the heads, I expect that brentp's solution of only sorting the chosen indices (if I understand correctly) will be more efficient, even if Python is likely to have a slower sort/random than Unix tools (shuf and sort). I also like the trick by Martin Morgan of running a random sampling routine twice with the same initial seed. In my own implementation, I also select each read based on a random value, but I work on pairs of files simultaneously, which requires more plumbing. Again, thanks to everybody who contributed! ADD COMMENT 0 Entering edit mode +1 for the open source program ADD REPLY 4 Entering edit mode 7.9 years ago You could use sample: $ sample -k 1000 -l 4 huge_file.fastq > 1k_sample.fq


This uses reservoir sampling on newline offsets to reduce memory overhead. If your system is thrashing for lack of system memory, this will help a great deal by reducing the memory overhead required for sampling as compared with other tools.

Including GNU sort, which will often run out of memory on genome-scale inputs, because, while it offers reservoir sampling, it stores the entire line in memory, for every line of input. For genome-scale work, this is often simply too much data.

Specify -k N for the number N of samples, and -l L for number of lines per record-to-sample (L). In the case of fastq, that would be four lines per record, so -l 4.

Add -d S to specify a seed value S, if desired, or it is drawn from a Mersenne Twister PRNG.

Generating your own random seed and applying that same seed when sampling two fastq files will let you grab the same samples from two files of paired reads:

$sample -k 1000 -l 4 -s 1234 first_pair.fastq > 1k_first_pair_sample.fq$ sample -k 1000 -l 4 -s 1234 second_pair.fastq > 1k_second_pair_sample.fq


If you have one interleaved file, then you could specify -l 8 to sample a pair of records from every eight lines:

\$ sample -k 1000 -l 8 interleaved.fastq > 1k_paired_sample.fq


You can also add -r to sample with replacement; the default is to sample without replacement. Or add -p to print a sample in the order provided by the original input.

0
Entering edit mode

Thank you for your response, it is helpful for me as it doesn't require a large amount of memory.

1
Entering edit mode
10.2 years ago
chris.mit7 ▴ 60

Bit old, but here's my solution

import random
f = open('CD4CD45RO_GATCAG_s_1_1_sequence.txt')
f2 = open('CD4CD45RO_GATCAG_s_1_2_sequence.txt')
o = openf.name+'.sample', 'wb')
o2 = openf2.name+'.sample', 'wb')
sampleSize = 1000000
tSize = 66000000
f2pos = []
p = float(sampleSize)/float(tSize)
while row:
if row[0] == '@':
if random.random():
f2pos.append(f.tell()-len(row))
o.write(row)
for i in xrange(3):
for pos in f2pos:
f2.seek(pos)
for i in xrange(4):


I'm not sure how long the others take, this one is done in like ~10 minutes on 66 million reads.

0
Entering edit mode
11.1 years ago
Paquola • 0

I just saw in the man page that shuf has this option:

   -n, --head-count=COUNT
output at most COUNT lines


This is potentially much more efficient than permuting the whole file and then | head.

But never mind... Tested it, and shuf puts the whole thing in memory. It's not really different than | head.

0
Entering edit mode
7.9 years ago
mondiraray • 0

((Used the HTSeq tool as recommended by Simon))

I also ran out of memory when I tried Aaron, Pierre, and Martin's recommendations. Hope this helps for anyone else also having memory issues. Thanks.

0
Entering edit mode
6.2 years ago

Dunno if still relevant, but I remember having written a tool for this that used one pass rejection sampling, see https://github.com/plijnzaad/phtools/blob/master/ngs/fastq-subsample-paired.pl

0
Entering edit mode

Isn't there a neater way of doing this nowadays? with some command from Samtools or a tool like that, maybe? thanks anyways for sharing your script