Discrepancy in read counts from fastq
0
1
Entering edit mode
15 days ago
marco.barr ▴ 120

Hello everyone, I'm seeking advice to understand why I'm getting slightly different results with two downsampling methods I've tried to determine a certain percentage of reads. In the first one, I used seqtk, and in the second one, I employed a more manual approach. Why am I getting a slightly different number of counted reads? It could be important for me as I'm working on patients. Below are my commands. Thank you all for your help.

seqtk sample -s100 file.fastq 0.2 > downsample_file.fastq
echo $(cat downsample_file.fastq | wc -l)/4 | bc 
146513

num_tot=$(wc -l file.fastq | awk '{print $1/4}')
reads_extr=$(echo "$num_tot * 0.2" | bc | cut -d. -f1)
head -n $((reads_extr * 4)) file.fastq > downsample1_file.fastq
echo $(cat downsample1_file.fastq | wc -l)/4 | bc
145946
downsampling fastq • 166 views
ADD COMMENT
0
Entering edit mode

Try reformat.sh from BBMap suite and its sampling parameters. You may get a third answer. Programs do some approximations so it is probably not surprising that you have different answers. But using same seed with a specific program should give you same results.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
ADD REPLY

Login before adding your answer.

Traffic: 2181 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6