Question: Selecting Random Pairs From Fastq?
13
gravatar for Ketil
6.3 years ago by
Ketil3.8k
Germany
Ketil3.8k wrote:

I'd like to select a random subset of Illumina PE reads from a set of files. I realize this is pretty trivial to do, but I thought I'd ask if there's a simple program available to do so? It's important that it retains both ends of each pair, and in the right order, and that it can work by streaming over the input sequences (obviously they don't fit in RAM).

Or do I just have to write it myself?

ADD COMMENTlink modified 14 months ago by plijnzaad30 • written 6.3 years ago by Ketil3.8k
2

Not sure I follow - are you saying a random sampling might introduce bias that wasn't there already? This seems to go against the definition of 'random'.

ADD REPLYlink written 6.3 years ago by Ketil3.8k

Do you expect your coverage to be even, otherwise you might get in hot water by introducing biases when randomly sampling over the whole genome/region: http://biostar.stackexchange.com/questions/4340/downsampling-bam-files

ADD REPLYlink written 6.3 years ago by Allpowerde1.2k
20
gravatar for Pierre Lindenbaum
6.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum95k wrote:

In one command line:

paste f1.fastq f2.fastq |\ #merge the two fastqs
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\ #merge by group of 4 lines
shuf  |\ #shuffle
head |\ #only 10 records
sed 's/\t\t/\n/g' |\ #restore the delimiters
awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files.
ADD COMMENTlink written 6.3 years ago by Pierre Lindenbaum95k
1

never heard of shuf before - turns out I have it (Fedora) - will the wonders ever cease

ADD REPLYlink written 6.3 years ago by Istvan Albert ♦♦ 71k
1

Hi This script seems to be a good solution. However I had a small problem - The fastq headers of the second file are deleted partially. Any solution ? Regards, P

ADD REPLYlink written 4.0 years ago by paulorapazote40
1

I believe that I was able to solve your problem, paulorapazote, by making the "\t" field delimiter explicit for awk in the last line (with the -F option), i.e.: awk -F '\t' '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by russianconcussion10

If "shuf" standard?

ADD REPLYlink written 6.3 years ago by Aaronquinlan9.8k

I found it on all the linux systems I used.

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum95k

Huh. It's not on any of mine, but I do see it's a GNU util. Thanks...learned something new!

ADD REPLYlink written 6.3 years ago by Aaronquinlan9.8k

Likewise. I wrote my own years ago b/c I understood there not to be such a GNU util.

ADD REPLYlink written 6.3 years ago by Aaronquinlan9.8k

Hi Pierre, I am curious, how do you know the right pairs are being paired? is that done by the merge function? thanks.

ADD REPLYlink written 4.3 years ago by dfernan590

what is the "right" pairs ? in my script , two files are temporarily merged into one file.

ADD REPLYlink written 4.3 years ago by Pierre Lindenbaum95k

well, then the pairs are in the same order in each file, but that makes sense, thanks!

ADD REPLYlink written 4.2 years ago by dfernan590
20
gravatar for lh3
5.2 years ago by
lh330k
United States
lh330k wrote:

An old question. With seqtk, to sample a fixed number of reads:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

Remember to use the same random seed "-s". It uses reservoir sampling, the same as R::yield(), which requires to keep 10000 records in memory in the above example. For sampling a large number of sequences, using a fraction is preferred as this method only keeps 1 record in memory. Here is an example:

seqtk sample -s100 read1.fq 0.1 > sub1.fq
seqtk sample -s100 read2.fq 0.1 > sub2.fq
ADD COMMENTlink written 5.2 years ago by lh330k
1

Easy to use and effective. Thanks a lot.

ADD REPLYlink written 4.2 years ago by Biomonika (Noolean)2.9k
1

is seqtk sample using sampling by replacement or without replacement?

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by epigene360

I think it is safe to assume that the sampling is without replacement.

This is because the name of the sequence is not changed moreover most aligners require that sequences that belong to different templates be named differently

ADD REPLYlink written 3.2 years ago by Istvan Albert ♦♦ 71k

I am having an issue using seqtk to subsample some illumina reads prior to assembly with MIRA.  I got the following message after I tried to subsample 1,000,000 reads from an illumina fastq paired end file:

"MIRA found duplicate read names in your data (see log above for more info).  *
*                                                                              *
* This should never, never be!"

 

When I subsample with seqtk is there a chance that it will choose the same sequence twice in a subsampling?  Just FYI, I am subsampling 1,000,000 reads from an 8,000,000 read paired end file set.

ADD REPLYlink written 2.0 years ago by julestrachsel20

It sounds like you're passing in a file that has interleaved reads. Is that true? In that case, you might get randomly get a forward and reverse pair right next to each other.

If so, you need to use seqtk on both the forward and reverse files independently, then interleave them.

ADD REPLYlink written 11 months ago by cypridina0

If I want to subsample 10m, 20m and 50m reads from a 70m reads file of fwd and rev, should I use different seed values for different read depts? I want to have same read pairs in fwd and rev, but random read pairs in different read depths. Also does it accept gzipped files?

ADD REPLYlink modified 12 months ago • written 12 months ago by rmf70

I'm hoping you've solved your problem in the last four months, but for future reference, I was wondering the same thing and YES you do have to change the seed or else it will generate the same 'random' sample.

ADD REPLYlink written 8 months ago by datascientist28270

are the sub sampled reads the same in file1 and file2? if not, is there a way to take from the first sub sampled file reads, the paired read of the second one?

ADD REPLYlink written 4 weeks ago by fusion.slope120
1

Yes, as long as you use the same -s.

ADD REPLYlink written 4 weeks ago by lh330k
16
gravatar for Martin Morgan
6.3 years ago by
Martin Morgan1.5k
United States
Martin Morgan1.5k wrote:

In Bioconductor, the ShortRead package has

library(ShortRead)
f1 <- FastqSampler("file1.fastq", n=1e6)
f2 <- FastqSampler("file2.fastq", n=1e6)

set.seed(123L); p1 <- yield(f1)
set.seed(123L); p2 <- yield(f2)

assuming reads are ordered identically in each file. FastqSampler subsets the file in a single pass, keeping the ith record if i <= n or a random (0, 1] deviate is less than n / i; in the latter case the ith read replaces a randomly chosen read in the n currently saved (actually, the implementation does this on chunks of reads, in line with how R works best).

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Martin Morgan1.5k
7
gravatar for Aaronquinlan
6.3 years ago by
Aaronquinlan9.8k
United States
Aaronquinlan9.8k wrote:

In the past, I've used the following paste/awk trick to combine the two files into a single line, assign a random number to each pair, sort by the random number, then choose the first N pairs you wish.

One command line, but broken up with comments for clarity.

# Starting FASTQ files
export FQ1=1.fq
export FQ2=2.fq

# The names of the random subsets you wish to create
export FQ1SUBSET=1.rand.fq
export FQ2SUBSET=2.rand.fq

# How many random pairs do we want?
export N=100

# paste the two FASTQ such that the 
# header, seqs, seps, and quals occur "next" to one another
  paste $FQ1 $FQ2 | \
# "linearize" the two mates into a single record.  Add a random number to the front of each line
  awk 'BEGIN{srand()}; {OFS="\t"; \
                        getline seqs; getline sep; getline quals; \
                        print rand(),$0,seqs,sep,quals}' | \
# sort by the random number
  sort -k1,1 | \
# grab the first N records
  head -n $N | \
# Convert the stream back to 2 separate FASTQ files.
  awk '{OFS="\n"; \
        print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
        print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Aaronquinlan9.8k

Hoopss, I've just realized that my answer was very close to yours

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum95k

Thanks Aaron!!!

ADD REPLYlink written 4.0 years ago by Michele Busby1.6k
7
gravatar for brentp
6.3 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

I had some code to do this for single end here: https://github.com/brentp/bio-playground/blob/master/fileindex/examples/bench.py

you can easily make that work for paired end by iterating over fq1 and fq2 in parallel. Actually, I edited it so it will do as you suggest and paste below.

you call it like:

$ python get_subset.py /path/to/reads_1.fastq /path/to/reads_2.fastq 100

and it will create new files with the same name as your paired files, but ending in .subset and containing 100 random records.

import random
import sys

def write_random_records(fqa, fqb, N=100000):
    """ get N random headers from a fastq file without reading the
    whole thing into memory"""
    records = sum(1 for _ in open(fqa)) / 4
    rand_records = sorted([random.randint(0, records - 1) for _ in xrange(N)])

    fha, fhb = open(fqa),  open(fqb)
    suba, subb = open(fqa + ".subset", "w"), open(fqb + ".subset", "w")
    rec_no = - 1
    for rr in rand_records:

        while rec_no < rr:
            rec_no += 1       
            for i in range(4): fha.readline()
            for i in range(4): fhb.readline()
        for i in range(4):
            suba.write(fha.readline())
            subb.write(fhb.readline())
        rec_no += 1 # (thanks @anderwo)

    print >>sys.stderr, "wrote to %s, %s" % suba.name, subb.name)

if __name__ == "__main__":
    N = 100 if len(sys.argv) < 4 else int(sys.argv[3])
    write_random_records(sys.argv[1], sys.argv[2], N)
ADD COMMENTlink modified 4.7 years ago by David Quigley10k • written 6.3 years ago by brentp22k
2

I think rand_records = sorted(random.sample(xrange(records), int(N))) Would do the job.

ADD REPLYlink written 5.3 years ago by Mitchell40

This bit is slick. Very nice.

records = sum(1 for _ in open(fqa)) / 4 rand_records = sorted([random.randint(0, records) for _ in xrange(N)])

ADD REPLYlink written 6.3 years ago by Aaronquinlan9.8k

thanks. it works nicely if you're not worried about leaving open file-handles (which i am not).

ADD REPLYlink written 6.3 years ago by brentp22k

Thanks for the script. I'm pretty sure the following lines should be altered Otherwise it returns the wrong number of reads

1) rand_records = sorted([random.randint(1, records) for _ in xrange(N)]) 2) rec_no = 0 3)for i in range(4): rec_no += 1
suba.write(fha.readline()) subb.write(fhb.read

ADD REPLYlink written 6.2 years ago by aunderwo0

@aunderwo, I changed it to do random.randint(0, records - 1), I think everything else works as it should.

ADD REPLYlink written 6.2 years ago by brentp22k

@brentp, you didn't comment on 2) from @aunderwo. I also agree that rec_no should be initialized 0 instead of -1. Right?

ADD REPLYlink written 3.7 years ago by Carlos Borroto1.5k

@brentp Thanks for the update. I still think you need the rec_no += 1 after the for i in range(4): otherwise over time it returns less records then specified. I have tested this with a range of N

ADD REPLYlink written 6.2 years ago by aunderwo0

@anderwo Thanks. You are correct. Updated. BTW, I have been using this: https://github.com/brentp/bio-playground/blob/master/reads-utils/select-random-pairs.py which does everything correctly (now that I just adjusted the upper bound to randint). It even checks to make sure it writes the correct number of records.

ADD REPLYlink written 6.2 years ago by brentp22k

@brentp: thanks for this. One potential issue when generating the list of record indices is that the same index can appear multiple times. Something like this would enforce uniqueness:

rand_records = [] while len(rand_records) < N: r = random.randint(0,records-1) if r not in rand_records: rand_records.append(r) rand_records = sorted(rand_records)

but isn't very elegant :(

ADD REPLYlink written 5.5 years ago by Peter Briggs20
3
gravatar for Ketil
6.3 years ago by
Ketil3.8k
Germany
Ketil3.8k wrote:

Thanks for all the answers. Being bored, I also wrote my own program for this (http://malde.org/~ketil/biohaskell/biolib/examples/RSelectPE.hs), but the examples above illustrates some nice techniques to achieve this using other tools, so I hope you'll forgive me for also attacking the problem myself.

Most solutions shuffle the files and takes the heads, I expect that brentp's solution of only sorting the chosen indices (if I understand correctly) will be more efficient, even if Python is likely to have a slower sort/random than Unix tools (shuf and sort).

I also like the trick by Martin Morgan of running a random sampling routine twice with the same initial seed. In my own implementation, I also select each read based on a random value, but I work on pairs of files simultaneously, which requires more plumbing.

Again, thanks to everybody who contributed!

ADD COMMENTlink written 6.3 years ago by Ketil3.8k

+1 for the open source program

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum95k
1
gravatar for chris.mit7
5.2 years ago by
chris.mit760
United States
chris.mit760 wrote:

Bit old, but here's my solution

import random
f = open('CD4CD45RO_GATCAG_s_1_1_sequence.txt')
f2 = open('CD4CD45RO_GATCAG_s_1_2_sequence.txt')
o = openf.name+'.sample', 'wb')
o2 = openf2.name+'.sample', 'wb')
row = f.readline()
sampleSize = 1000000
tSize = 66000000
f2pos = []
p = float(sampleSize)/float(tSize)
while row:
    if row[0] == '@':
        if random.random()<p:
            f2pos.append(f.tell()-len(row))
            o.write(row)
            for i in xrange(3):
                o.write(f.readline())
    row = f.readline()
for pos in f2pos:
    f2.seek(pos)
    for i in xrange(4):
        o2.write(f2.readline())

I'm not sure how long the others take, this one is done in like ~10 minutes on 66 million reads.

ADD COMMENTlink modified 2.8 years ago • written 5.2 years ago by chris.mit760
1
gravatar for Alex Reynolds
2.9 years ago by
Alex Reynolds19k
Seattle, WA USA
Alex Reynolds19k wrote:

I wrote a tool in C called sample that was designed to get past memory limitations in GNU shuf, which can choke on shuffling inputs of the scale dealt with in genomic studies.

The sample program stores an eight-byte integer for every location of a per-line or per-multiple-line element in a text-formatted input file. These locations are shuffled and a uniformly-random sample is taken of the desired sample size.

For a, say, 4 GiB computer that can allocate 3.5 GiB of memory to sample, up to 470M four-line FASTQ records can be sampled without replacement (235M, with replacement):

$ sample --lines-per-offset=4 --sample-size=${N} reads.fastq > sample.fastq

If the FASTQ records are paired, paired samples can be derived by making a FASTQ file where one record is followed immediately by its pair, and then using sample with --lines-per-offset=8.

The increased lines-per-offset parameter allows sampling (without replacement) a paired input file containing up to 940M paired records (470M, with replacement) on a 4 GiB system allocating 3.5 GiB to sample:

$ sample --lines-per-offset=8 --sample-size=${N} paired_reads.fastq > paired_sample.fastq

The file paired_reads.fastq can be created by linearizing the two FASTQ files, using paste to interleave them with a newline delimiter, and then delinearizing the resulting sample.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Alex Reynolds19k
0
gravatar for Paquola
6.1 years ago by
Paquola0
Paquola0 wrote:

I just saw in the man page that shuf has this option:

   -n, --head-count=COUNT
          output at most COUNT lines

This is potentially much more efficient than permuting the whole file and then | head.

But never mind... Tested it, and shuf puts the whole thing in memory. It's not really different than | head.

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Paquola0
0
gravatar for mondiraray
2.9 years ago by
mondiraray0
United States
mondiraray0 wrote:

http://seqanswers.com/forums/showthread.php?t=12070

((Used the HTSeq tool as recommended by Simon))

I also ran out of memory when I tried Aaron, Pierre, and Martin's recommendations. Hope this helps for anyone else also having memory issues. Thanks.

 

ADD COMMENTlink written 2.9 years ago by mondiraray0
0
gravatar for plijnzaad
14 months ago by
plijnzaad30
European Union
plijnzaad30 wrote:

Dunno if still relevant, but I remember having written a tool for this that used one pass rejection sampling, see https://github.com/plijnzaad/phtools/blob/master/ngs/fastq-subsample-paired.pl

ADD COMMENTlink written 14 months ago by plijnzaad30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 815 users visited in the last hour