1
0
Entering edit mode
5 months ago
Jjbox ▴ 40

Hi all,

FASTQ files contain sequencing reads 'as they come off the sequencing instrument.' Is there any particular order to them in long read fastq file for ONT and PacBio? E.g. based on the position of the flow cell? Quality?

I am trying to extract certain number of reads from both ONT and PacBio using seqtk sample something like below.

./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq


I want to make sure the above example, pcb_sub.fastq, gives 10,000 reads among the total number of reads in pcb.fastq file.

fastq seqtk rna-seq • 570 views
0
Entering edit mode
5 months ago
GenoMax 127k

You could sub-sample and generate a number of files. Cat them together and then sample again from that pool to ensure that you get a random mix.

You could also try reformat.sh from BBMap suite that give you control over how you sample:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).

0
Entering edit mode

Hello, thanks for helping. Can you clarify what do you mean by "Cat them together and then sample again from that pool to ensure that you get a random mix"?

I thought extracting pcb_sub.fastq as the above command gives random mix of 10,000 reads.

Thanks!

0
Entering edit mode

reformat.sh will give you a random mix if you use the sampling parameters. If you were worried about there being some pattern in either seqtk or above command then you can do multiple sampling rounds to get a new set to sample from, if you had a gigantic dataset to begin with. I was perhaps being too conservative.

0
Entering edit mode

Oh I see, so for seqtk, I can do the above command line with different seed number such as ./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq in the first round and, ./seqtk sample -s101 pcb_sub.fastq10000 > pcb_sub.fastq in the second round, right?

0
Entering edit mode

If you want to be super cautious, then yes.