Is there any method to randomize the read order in a multi-Gbp fastq file?
Is there any method to randomize the read order in a multi-Gbp fastq file?
Assuming you are talking about a single-end file, you can use awk to put each 4-line fastq entry on a single line. You then use GNU shuf
, sort -R
(later versions of sort; if not available, go to GNU Utils), or my shuffle
tool in the filo package. The output will be a shuffled stream of one-line per fastq entry, so you will need to use awk once more to make a 4-line-per-entry file. Below should work.
awk '{OFS="\t"; getline seq; \
getline sep; \
getline qual; \
print $0,seq,sep,qual}' reads.fq | \
sort -R | \
awk '{OFS="\n"; print $1,$2,$3,$4}' \
> reads.shuffled.fq
You could extend this example for paired-end fastq by reading in two files at once with awk.
I've been playing with Python trying to solve the paired-end FastQ order randomisation.
After very unsuccessful afternoon, I've decided to try BASH. BASH-based solution is simple and efficient:
paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 > "random.1.fq"; print $2,$4,$6,$8 > "random.2.fq"}'
I know this post is a decade old, but I couldn’t get either solution to run. I have paired-end reads, and when I tried feeding both reads in Aaronquinlan’s solution, I just get empty files. I’m certain I haven’t altered the code correctly to accept paired-end read files.
When I tried Leszek’s solution (who also struggled to adapt Aaronquinlan’s for PE reads), it “thinks” for awhile and ultimately just returns the command prompt without generating any files (or errors).
Googling suggests that many people split the files multiple times and recombine them (I think?) but this is not practical for me. I have concatenated about 2,000 reads onto 39M reads. I just want to shuffle them. Any suggestions for how to do this? I feel like this should be possible with a bash (especially awk + shuf) solution and I want to get that working. Thanks!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
shuf is listed in the answer. i find that systems that lack sort -R also lack shuf.
I did a quick benchmark and found that
shuf
was MUCH faster thansort -R
in my environment (linux). I canceled thesort -R
after 5 minutes or so...I actually created an account just to comment and thank Brad Langhorst.
shuf
is MUCH MUCH faster thansort -R
(Ubuntu 16.04). I had to shuffle 10 million reads, and after 15-20 minutes I stopped the script containingsort -R
, changed that intoshuf
and it was done in about 40 seconds (probably even a bit less).thanks, that's a neat solution. the farm I am using doesn't have sort -R or shuf, so I'll try and see if a more modern version can be locally installed.
I see. I also have a "shuffle" program in the filo package. The downside of that tool is that it reads all of the records into memory.
if sort -R isn't available, you can try the 'shuf' command
I locally installed a modern coreutils following instructions here