Downsampling fastq file
0
0
Entering edit mode
11 days ago
marco.barr ▴ 90

Hi everyone, I'm attempting to downsample a fastq file to retain only 20% of the reads. I'm using seqtk with the command: seqtk sample -s 11000 file.fastq 0.2 > downsample_file.fastq. However, it seems to be doing the opposite, filtering out 20% instead. Did I make an error in the command? Should I use 0.8 instead? Thank you for your assistance

downsample fastq • 448 views
ADD COMMENT
1
Entering edit mode

It should not do that. Wild suggestion, but try using something like 100 for the seed value - maybe the large seed value is causing some sort of unexpected bug. I know it doesn't make sense but give it a shot.

ADD REPLY
0
Entering edit mode

I followed your advice and it seems that I'm getting results comparable to what I was getting before. Upon checking with wc -l on the original R1.fastq file, I have 298949 lines, while in the downsampled file I even have more lines, 584432. How is this possible? Should I use reformat.sh since it's a paired-end file? Thanks for the advice

ADD REPLY
0
Entering edit mode

The file being PE should not matter. I'd recommend you open an issue on the seqtk github repo as this is starting to look like some sort of niche bug.

ADD REPLY
0
Entering edit mode

I understood where the problem lies. I discussed with my wet lab colleagues (this is part of a bioinformatician's job...) and by showing them the results, we realized that they had made errors in DNA extraction, which affected everything else despite the fastq appearing 'clean'. Unfortunately, the saying 'garbage in, garbage out' always holds true... Thank you, Ram, for your advice.

ADD REPLY
0
Entering edit mode

maybe try using seqkit instead?

ADD REPLY
0
Entering edit mode

I think the problem might be specifying proportion and fixed numbers in the same command line you used. Instead, please try this-

#1. Downsample a fraction of reads 
seqtk sample file1.fastq 0.2 > file1_sub1.fastq
seqtk sample file2.fastq 0.2 > file2_sub1.fastq

OR if you want a fixed number of reads

#2. Downsample a fixed number 
seqtk sample file1.fastq 20000 > file1_sub1.fastq
seqtk sample file2.fastq 20000 > file2_sub1.fastq
ADD REPLY
0
Entering edit mode

OP is not specifying both. The 11000 is the seed value, not the number of reads.

ADD REPLY
0
Entering edit mode

I'm moving this to a comment for now.

ADD REPLY

Login before adding your answer.

Traffic: 2489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6