Subsampling with BBtools by read number
1
0
Entering edit mode
2.1 years ago

I already know that reformat.sh can be used to subsample but this is done by percentage. Is there a method to specify sub-sampling a fixed number of reads? I know the reads=# flag exists but this seems to occur before the subsampling. I know if worst comes to worst I can randomly subsample 99% of the file and then run a second round of reformat using reads= to get my fixed number, but a single step would be much easier.

bbtools subsampling • 1.5k views
0
Entering edit mode

As long as you set sampleseed the sampling should be deterministic but random.

Have you tried using a combination of that or samplerate along with reads?

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f      If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).

0
Entering edit mode

So the deterministic part doesn't matter. My undertsanding is reads refers to how many are processed. So if you did reads=10 samplerate=0.9 it wouldn't give you 9 random reads from the file, but rather the 9 of the first 10 reads randomized which is still the same data. Maybe I'm wrong?

1
Entering edit mode

Did my suggestion work? Or were you able to find parameters that work.

2
Entering edit mode
15 months ago

The reformat.sh option samplereadstarget seems to be exactly what you want.

To subsample 1M reads from a file, I ran:

reformat.sh \
in=input.fq \
out=output.fq \
sampleseed=13


To check:

wc -l output.fq


4000000 -- as expected for a fastq file.

Additionally, to check that the reads were random and not just the top 1M reads from the file, I checked the first few read names and they seem different:

grep -m 5 "^@" input.fq


@NB501138:242:H7WV2BGXB:1:11101:20526:1063 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:4613:1064 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:10306:1066 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:1896:1067 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:16712:1068 1:N:0:AAGTTGAT

grep -m 5 "^@" output.fq


@NB501138:242:H7WV2BGXB:1:11101:25676:1087 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:9954:1149 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:22627:1297 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:13619:2023 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:1974:2172 1:N:0:AAGTTGAT