Question

Discrepancy in read counts from fastq

1

Entering edit mode

17 days ago

marco.barr ▴ 130

Hello everyone, I'm seeking advice to understand why I'm getting slightly different results with two downsampling methods I've tried to determine a certain percentage of reads. In the first one, I used seqtk, and in the second one, I employed a more manual approach. Why am I getting a slightly different number of counted reads? It could be important for me as I'm working on patients. Below are my commands. Thank you all for your help.

seqtk sample -s100 file.fastq 0.2 > downsample_file.fastq
echo $(cat downsample_file.fastq | wc -l)/4 | bc 
146513

num_tot=$(wc -l file.fastq | awk '{print $1/4}')
reads_extr=$(echo "$num_tot * 0.2" | bc | cut -d. -f1)
head -n $((reads_extr * 4)) file.fastq > downsample1_file.fastq
echo $(cat downsample1_file.fastq | wc -l)/4 | bc
145946

downsampling fastq • 171 views

ADD COMMENT • link updated 16 days ago by Ram 43k • written 17 days ago by marco.barr ▴ 130

0

Entering edit mode

Try reformat.sh from BBMap suite and its sampling parameters. You may get a third answer. Programs do some approximations so it is probably not surprising that you have different answers. But using same seed with a specific program should give you same results.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

ADD REPLY • link 17 days ago by GenoMax 142k