Randomize Read Order In Multigbp Fastq File?
3
3
Entering edit mode
11.8 years ago

Is there any method to randomize the read order in a multi-Gbp fastq file?

fastq • 6.2k views
11
Entering edit mode
11.8 years ago

Assuming you are talking about a single-end file, you can use awk to put each 4-line fastq entry on a single line. You then use GNU shuf, sort -R (later versions of sort; if not available, go to GNU Utils), or my shuffle tool in the filo package. The output will be a shuffled stream of one-line per fastq entry, so you will need to use awk once more to make a 4-line-per-entry file. Below should work.

awk '{OFS="\t"; getline seq; \
getline sep; \
getline qual; \
print $0,seq,sep,qual}' reads.fq | \ sort -R | \ awk '{OFS="\n"; print$1,$2,$3,$4}' \ > reads.shuffled.fq  You could extend this example for paired-end fastq by reading in two files at once with awk. ADD COMMENT 1 Entering edit mode shuf is listed in the answer. i find that systems that lack sort -R also lack shuf. ADD REPLY 1 Entering edit mode I did a quick benchmark and found that shuf was MUCH faster than sort -R in my environment (linux). I canceled the sort -R after 5 minutes or so... langhorst@seq02-i:~$ time shuf /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null

real0m4.552s
user0m4.220s
sys0m0.330s

langhorst@seq02-i:~$time sort -R /galaxy/galaxy-dist/database/files/020/dataset_20337.dat > /dev/null ^C real5m23.051s user5m22.530s sys0m0.500s  ADD REPLY 0 Entering edit mode I actually created an account just to comment and thank Brad Langhorst. shuf is MUCH MUCH faster than sort -R (Ubuntu 16.04). I had to shuffle 10 million reads, and after 15-20 minutes I stopped the script containing sort -R, changed that into shuf and it was done in about 40 seconds (probably even a bit less). ADD REPLY 0 Entering edit mode thanks, that's a neat solution. the farm I am using doesn't have sort -R or shuf, so I'll try and see if a more modern version can be locally installed. ADD REPLY 0 Entering edit mode I see. I also have a "shuffle" program in the filo package. The downside of that tool is that it reads all of the records into memory. ADD REPLY 0 Entering edit mode if sort -R isn't available, you can try the 'shuf' command ADD REPLY 0 Entering edit mode I locally installed a modern coreutils following instructions here ADD REPLY 5 Entering edit mode 8.5 years ago Leszek 4.2k I've been playing with Python trying to solve the paired-end FastQ order randomisation. After very unsuccessful afternoon, I've decided to try BASH. BASH-based solution is simple and efficient: paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print$1,$3,$5,$7 > "random.1.fq"; print$2,$4,$6,\$8 > "random.2.fq"}'

0
Entering edit mode
8 weeks ago

I know this post is a decade old, but I couldn’t get either solution to run. I have paired-end reads, and when I tried feeding both reads in Aaronquinlan’s solution, I just get empty files. I’m certain I haven’t altered the code correctly to accept paired-end read files.

When I tried Leszek’s solution (who also struggled to adapt Aaronquinlan’s for PE reads), it “thinks” for awhile and ultimately just returns the command prompt without generating any files (or errors).

Googling suggests that many people split the files multiple times and recombine them (I think?) but this is not practical for me. I have concatenated about 2,000 reads onto 39M reads. I just want to shuffle them. Any suggestions for how to do this? I feel like this should be possible with a bash (especially awk + shuf) solution and I want to get that working. Thanks!