Question

Randomize Read Order In Multigbp Fastq File?

3

Entering edit mode

12.8 years ago

2184687-1231-83- ★ 5.1k

Is there any method to randomize the read order in a multi-Gbp fastq file?

fastq • 7.7k views

ADD COMMENT • link updated 5 weeks ago by Cliff • 0 • written 12.8 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

I know this post is a decade old, but I couldn’t get either solution to run. I have paired-end reads, and when I tried feeding both reads in Aaronquinlan’s solution, I just get empty files. I’m certain I haven’t altered the code correctly to accept paired-end read files.

When I tried Leszek’s solution (who also struggled to adapt Aaronquinlan’s for PE reads), it “thinks” for awhile and ultimately just returns the command prompt without generating any files (or errors).

Googling suggests that many people split the files multiple times and recombine them (I think?) but this is not practical for me. I have concatenated about 2,000 reads onto 39M reads. I just want to shuffle them. Any suggestions for how to do this? I feel like this should be possible with a bash (especially awk + shuf) solution and I want to get that working. Thanks!

ADD REPLY • link 14 months ago by sovrappensiero ▴ 90

Ram · Answer 1 · 2011-07-01

11

Entering edit mode

12.8 years ago

Aaronquinlan 12k

Assuming you are talking about a single-end file, you can use awk to put each 4-line fastq entry on a single line. You then use GNU shuf, sort -R (later versions of sort; if not available, go to GNU Utils), or my shuffle tool in the filo package. The output will be a shuffled stream of one-line per fastq entry, so you will need to use awk once more to make a 4-line-per-entry file. Below should work.

awk '{OFS="\t"; getline seq; \
                getline sep; \
                getline qual; \
                print $0,seq,sep,qual}' reads.fq | \
sort -R | \
awk '{OFS="\n"; print $1,$2,$3,$4}' \
> reads.shuffled.fq

You could extend this example for paired-end fastq by reading in two files at once with awk.

ADD COMMENT • link updated 6 weeks ago by Ram 43k • written 12.8 years ago by Aaronquinlan 12k

1

Entering edit mode

shuf is listed in the answer. i find that systems that lack sort -R also lack shuf.

ADD REPLY • link 12.8 years ago by Aaronquinlan 12k

1

Entering edit mode

I did a quick benchmark and found that shuf was MUCH faster than sort -R in my environment (linux). I canceled the sort -R after 5 minutes or so...

langhorst@seq02-i:~$ time shuf /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null

real0m4.552s
user0m4.220s
sys0m0.330s

langhorst@seq02-i:~$ time sort -R /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null
^C

real5m23.051s
user5m22.530s
sys0m0.500s

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 11.6 years ago by Brad Langhorst ▴ 120

0

Entering edit mode

I actually created an account just to comment and thank Brad Langhorst.

shuf is MUCH MUCH faster than sort -R (Ubuntu 16.04). I had to shuffle 10 million reads, and after 15-20 minutes I stopped the script containing sort -R, changed that into shuf and it was done in about 40 seconds (probably even a bit less).

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 6.7 years ago by m.r.c.r. • 0

0

Entering edit mode

thanks, that's a neat solution. the farm I am using doesn't have sort -R or shuf, so I'll try and see if a more modern version can be locally installed.

ADD REPLY • link 12.8 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

I see. I also have a "shuffle" program in the filo package. The downside of that tool is that it reads all of the records into memory.

ADD REPLY • link updated 3.1 years ago by Ram 43k • written 12.8 years ago by Aaronquinlan 12k

0

Entering edit mode

if sort -R isn't available, you can try the 'shuf' command

ADD REPLY • link 12.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I locally installed a modern coreutils following instructions here

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by 2184687-1231-83- ★ 5.1k

Ram · Answer 2 · 2014-10-13

I've been playing with Python trying to solve the paired-end FastQ order randomisation.

After very unsuccessful afternoon, I've decided to try BASH. BASH-based solution is simple and efficient:

paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 > "random.1.fq"; print $2,$4,$6,$8 > "random.2.fq"}'

score 2 · Answer 3 · 2023-11-04

2

Entering edit mode

5 months ago

Artyom ▴ 20

The most voted answer fails when headers contain spaces (which they might), so here is my solution (paste, shuf, tr, awk):

Single end: https://gist.github.com/iam28th/49a245427ea2b8ed5f1f9889c13468bf
Paired end: https://gist.github.com/iam28th/418dc7d5048067af194a76ffb5840c90

ADD COMMENT • link 5 months ago by Artyom ▴ 20

score 2 · Answer 4 · 2023-11-06

2

Entering edit mode

5 months ago

Brian Bushnell 20k

BBTools' shuffle2.sh can randomize fastq files of arbitrary size, including paired twin files, keeping pairs together. shuffle.sh is there too but it required the input to fit in memory; shuffle2.sh will write temp files when the data won't fit into memory. It handles multiline fasta files too.

ADD COMMENT • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian, I've tried your solution, but there is an issue I couldn't figure out. I tried to input paired-end fastq.gz files, and it returned me two outputs. However, one is 70GB, and the other is empty. Here is the command I used:

shuffle2.sh -eoom -da -Xmx100G seed=123 ziplevel=2 in=test_R1_001.fastq.gz in2=test_R2_001.fastq.gz out=bbtest_r1.fq.gz out2=bbtest_r2.fq.gz

ADD REPLY • link 7 weeks ago by Cliff • 0

0

Entering edit mode

Looks like a bug, I'll investigate. What is your input file size?

ADD REPLY • link 5 weeks ago by Brian Bushnell 20k

0

Entering edit mode

One is 37G, another is 40G.

ADD REPLY • link 5 weeks ago by Cliff • 0