Question

How to randamly extract reads from a FASTQ file?

2

Entering edit mode

8.9 years ago

biolab ★ 1.4k

Hi everyone,

I want to randomly extract some numbers of reads from a fastq file, and save as a new smaller fastq file. Is there any perl script or shell command to achieve this?

Thank you very much!

fastq • 12k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by biolab ★ 1.4k

0

Entering edit mode

srand is a perl function that can be used for this kind of tasks. You can write your own script using this function.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by venu 7.1k

Ram · Answer 1 · 2015-05-16

9

Entering edit mode

8.9 years ago

rtliu ★ 2.2k

How To Obtain And Install Seqtk

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by rtliu ★ 2.2k

0

Entering edit mode

Thank you all for your helpful comments!

ADD REPLY • link 8.9 years ago by biolab ★ 1.4k

Ram · Answer 2 · 2015-05-18

2

Entering edit mode

8.9 years ago

5heikki 11k

grep "^@HWI" file.fq | shuf -n 10 | grep -A 3 -x -f - file.fq | grep -v "^--$"

Fetch all lines that begin with "$HWI" (you might have to change this), randomly select 10 from these, fetch the selected lines and 3 lines below them (fastq format = header + 3 lines, header + 3 lines, ..), remove "--" lines (artifact of grep).

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by 5heikki 11k

1

Entering edit mode

Hi- I think this is a nice solution. But is it practical? I suspect shuf reads in memory the whole input first. Also, isn't the second grep going to be very slow for, say, a fastq file of 100M reads and a sample of 1M?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by dariober 14k

1

Entering edit mode

It's practical in the sense that you will have a ton of time to browse biostars while waiting for the command to complete. You're right though, it's very, very slow for large files.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by 5heikki 11k

0

Entering edit mode

My answer in this thread suggests a tool I wrote (sample) that gets around the memory limitations in shuf. If the scale of your input gives you memory errors with shuf, you might look into sample. See: How to randamly extract reads from a FASTQ file?

The sample utility uses the same sampling technique as shuf, but while shuf reads the entire file into memory, sample stores the locations of starting points for lines (or groups of lines) and samples from those locations, instead.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Alex Reynolds 35k

Ram · Answer 3 · 2015-05-16

1

Entering edit mode

8.9 years ago

Brian Bushnell 20k

You can do that with the BBMap package:

reformat.sh in=reads.fq out=subsampled.fq samplerate=0.1

It will keep pairs together. If the pairs are in 2 files, use in1 and in2, and optionally, out1 and out2. It's much faster than a perl script.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Does # work here for two files? as in in=reads_R#.fq

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by apelin20 ▴ 480

2

Entering edit mode

Yes, you could also write this:

reformat.sh in=read_#.fq out=subsampled_#.fq samplerate=0.1

...which would use 2 input files, read_1.fq and read_2.fq, and generate 2 output files.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Brian Bushnell 20k

Ram · Answer 4 · 2015-05-16

You can sample with or without replacement with sample using a --lines-per-offset value of 4 for FASTQ files:

$ sample --sample-size=${SAMPLE_SIZE} \
         --lines-per-offset=4 \
       [ --sample-without-replacement | --sample-with-replacement ] \
         reads.fq \
       > sample.fq

Add the --preserve-order option, if you want the sample to keep reads in the same order as in the original file.

Ram · Answer 5 · 2015-05-17

1

Entering edit mode

8.9 years ago

dariober 14k

If your input fastq hasn't been sorted in any particular way (i.e. it comes from the sequencer as is), you can probably assume that the reads in it are already in random order. So just get your desired number of reads straight from the fastq file. Just skip the first few hundreds of thousands of reads since these are usually bad quality:

gunzip -c reads.fq.gz \
| tail -n+1000000 \
| head  -n 4000000 \
| gzip > sampled.fq.gz

Here input and output are gzipped, first 250,000 reads skipped, output 1,000,000 reads. For testing purposes this might be good enough, quick and no dependencies.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by dariober 14k

1

Entering edit mode

The output of the sequencer is not in a random order: it follows the spatial progression of the imaging system on the slides. As you noted the first sequences are already bad quality (because they are from a corner ?). But bad quality areas also happen in other parts of the slide, in a much less predictable manner. So your method is likely to either over-represent low-quality reads if it hits such an area, or to under-represent them if it misses the area.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Sure, I meant random with respect to the genome sequence or sequence content. For critical analyses I wouldn't do what I suggest. As I said, it's a quick and easy way to get a sample of reads for testing, e.g. if a pipeline works smoothly and gives sensible results.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by dariober 14k