Reduce the number of PE reads by half
2
0
Entering edit mode
5.2 years ago
BioGeek ▴ 150

I would like to reduce the number of PE reads by half (and keep both in two different files). Is there any quick way to achieve it?

Assembly NGS PE Reads • 978 views
0
Entering edit mode

is this a random sampling of 50% of the reads? Or a 'down-the-middle' split?

1
Entering edit mode

if all you want is the 'first 50% of the reads' in the file without random sampling, you can (1) count the number of reads in the fastq: cat your.fastq | echo \$((wc -l/4)) (2) divide the number of reads by 2 (3) multiply this number by 4 to get the number of lines you need, and then (4) head -n #lines to get the first 50% of the sequences you need. (6) use tail to get the bottom 50%

0
Entering edit mode

you can use seqtk sample function

0
Entering edit mode

If you know Python, you can use HTSeq for subsampling, but to get the other half would half to follow @genomax2's suggestion of find the reads by header that didn't end up in your subsamples files. Here's the example on seqanswers.

3
Entering edit mode
5.2 years ago
GenoMax 107k

reformat.sh from BBMap.

reformat.sh in1=read1.fq.gz in2=read2.fq.gz out1=new1.fq.gz out2=new2.fq.gz samplerate=0.5

0
Entering edit mode

Thanks for your reply. I guess, it extract the reads "randomly". Now, how to extract the remaining 50% ?

1
Entering edit mode

I think you will need to grab the ID's of reads that got selected in first round and then use filterbyname.sh from BBMap to get the rest in separate files.

1
Entering edit mode
5.2 years ago
igor 12k

A few options: Selecting Random Pairs From Fastq?