Entering edit mode
17 days ago
marco.barr
▴
130
Hello everyone, I'm seeking advice to understand why I'm getting slightly different results with two downsampling methods I've tried to determine a certain percentage of reads. In the first one, I used seqtk, and in the second one, I employed a more manual approach. Why am I getting a slightly different number of counted reads? It could be important for me as I'm working on patients. Below are my commands. Thank you all for your help.
seqtk sample -s100 file.fastq 0.2 > downsample_file.fastq
echo $(cat downsample_file.fastq | wc -l)/4 | bc
146513
num_tot=$(wc -l file.fastq | awk '{print $1/4}')
reads_extr=$(echo "$num_tot * 0.2" | bc | cut -d. -f1)
head -n $((reads_extr * 4)) file.fastq > downsample1_file.fastq
echo $(cat downsample1_file.fastq | wc -l)/4 | bc
145946
Try
reformat.sh
from BBMap suite and its sampling parameters. You may get a third answer. Programs do some approximations so it is probably not surprising that you have different answers. But using same seed with a specific program should give you same results.