I would like to reduce the number of PE reads by half (and keep both in two different files). Is there any quick way to achieve it?
is this a random sampling of 50% of the reads? Or a 'down-the-middle' split?
if all you want is the 'first 50% of the reads' in the file without random sampling, you can (1) count the number of reads in the fastq: cat your.fastq | echo $((wc -l/4)) (2) divide the number of reads by 2 (3) multiply this number by 4 to get the number of lines you need, and then (4) head -n #lines to get the first 50% of the sequences you need. (6) use tail to get the bottom 50%
cat your.fastq | echo $((
head -n #lines
you can use seqtk sample function
If you know Python, you can use HTSeq for subsampling, but to get the other half would half to follow @genomax2's suggestion of find the reads by header that didn't end up in your subsamples files. Here's the example on seqanswers.
reformat.sh from BBMap.
reformat.sh in1=read1.fq.gz in2=read2.fq.gz out1=new1.fq.gz out2=new2.fq.gz samplerate=0.5
Thanks for your reply.
I guess, it extract the reads "randomly". Now, how to extract the remaining 50% ?
I think you will need to grab the ID's of reads that got selected in first round and then use filterbyname.sh from BBMap to get the rest in separate files.
A few options: Selecting Random Pairs From Fastq?
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy