Question

seqtk subsample fastq file question

0

Entering edit mode

15 months ago

shinyjj ▴ 50

Hi biostars,

I have a long read fastq file. It has 57,523,865 read counts, and trying to subsample using seqtk, but it gives a zero read counts. Can someone help with this issue?

wc -l ALL1807_RW0588_051220_LiveGuppy.fastq
230095460 ALL1807_RW0588_051220_LiveGuppy.fastq

Here is the seqtk command line I used.

./seqtk sample ~/ALL1807_RW0588_051220_LiveGuppy.fastq 19473944 > ALL1807_RW0588_051220_LiveGuppy_sub.fastq

When I count the mean read length, it gives a decimal point. Does having a decimal point make sense? I have never seen a mean read length with a decimal point. Can this be the issue why seqtk subsample is not working?

awk '{if(NR%4==2) {count++; bases += length} } END{print bases/count}' ALL1807_RW0588_051220_LiveGuppy.fastq
869.649

seqtk subsample • 777 views

ADD COMMENT • link updated 15 months ago by michael ▴ 10 • written 15 months ago by shinyjj ▴ 50

0

Entering edit mode

Solved! Thanks

ADD REPLY • link 15 months ago by shinyjj ▴ 50

score 0 · Answer 1 · 2023-01-18

0

Entering edit mode

15 months ago

michael ▴ 10

You can also check out https://github.com/mbhall88/rasusa for subsampling reads. It was originally designed with long reads in mind. It will also allow you to subsample to a number of bases or coverage if that is more what you're after rather than number of reads.

ADD COMMENT • link 15 months ago by michael ▴ 10