Subsampling with BBtools by read number
Entering edit mode
4.2 years ago

I already know that can be used to subsample but this is done by percentage. Is there a method to specify sub-sampling a fixed number of reads? I know the reads=# flag exists but this seems to occur before the subsampling. I know if worst comes to worst I can randomly subsample 99% of the file and then run a second round of reformat using reads= to get my fixed number, but a single step would be much easier.

bbtools subsampling • 3.5k views
Entering edit mode

As long as you set sampleseed the sampling should be deterministic but random.

Have you tried using a combination of that or samplerate along with reads?

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f      If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).
Entering edit mode

So the deterministic part doesn't matter. My undertsanding is reads refers to how many are processed. So if you did reads=10 samplerate=0.9 it wouldn't give you 9 random reads from the file, but rather the 9 of the first 10 reads randomized which is still the same data. Maybe I'm wrong?

Entering edit mode

Please don't delete posts after they have received a comment/answer.

Did my suggestion work? Or were you able to find parameters that work.

Entering edit mode
3.4 years ago

The option samplereadstarget seems to be exactly what you want.

To subsample 1M reads from a file, I ran: \
    in=input.fq \
    out=output.fq \
    samplereadstarget=1000000 \

To check:

wc -l output.fq

4000000 -- as expected for a fastq file.

Additionally, to check that the reads were random and not just the top 1M reads from the file, I checked the first few read names and they seem different:

grep -m 5 "^@" input.fq

@NB501138:242:H7WV2BGXB:1:11101:20526:1063 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:4613:1064 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:10306:1066 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:1896:1067 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:16712:1068 1:N:0:AAGTTGAT

grep -m 5 "^@" output.fq

@NB501138:242:H7WV2BGXB:1:11101:25676:1087 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:9954:1149 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:22627:1297 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:13619:2023 1:N:0:AAGTTGAT @NB501138:242:H7WV2BGXB:1:11101:1974:2172 1:N:0:AAGTTGAT


Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6