Question

extract subset of sequence

0

Entering edit mode

6.7 years ago

Björn ▴ 110

Hi, How can I extract first 1 million lines or lets say first 250,000 reads (small rna) from xyz.fastq.gz file and export as a new file?

RNA-Seq fastq • 4.5k views

ADD COMMENT • link updated 6.7 years ago by Matt Shirley 10k • written 6.7 years ago by Björn ▴ 110

0

Entering edit mode

How to randamly extract reads from a FASTQ file?

or

zcat file.fastq.gz | head -4*#readYouWant > new.fastq

ADD REPLY • link 6.7 years ago by noeD ▴ 130

score 2 · Answer 1 · 2017-07-31

2

Entering edit mode

6.7 years ago

cpad0112 21k

To extract first 250000 reads from xyz.fastq.gz (assuming that the said file has more than 250000 reads):

seqkit head -n 250000 xyz.fastq.gz > ouput.fq

Download seqkit from here. To count records in the output:

seqkit seq -n output.fq | wc -l

Output should be 250000.

ADD COMMENT • link 6.7 years ago by cpad0112 21k

1

Entering edit mode

just seqkit stats xx.fq.gz for counting.

it also support write gzipped file with -o out.fq.gz

ADD REPLY • link 6.7 years ago by shenwei356 8.4k

0

Entering edit mode

thanks @shenwei356

ADD REPLY • link 6.7 years ago by cpad0112 21k

score 1 · Answer 2 · 2017-07-31

Using reformat.sh from BBMap suite. reformat.sh in=original.fq.gz out=sampled.fq.gz additional_parameter_below

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

score 0 · Answer 3 · 2017-07-31

0

Entering edit mode

6.7 years ago

Pierre Lindenbaum 161k

Hi, How can I extract first 1 million lines or lets say first 250,000 reads (small rna) from xyz.fastq.gz file and export as a new file?

gunzip -c in.fq.gz | head -n 1000000 | gzip > out.fq.gz

ADD COMMENT • link 6.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

pigz is recommended for faster speed.

ADD REPLY • link 6.7 years ago by shenwei356 8.4k

0

Entering edit mode

elegant:) one has to know there are exactly 4 lines for each read

ADD REPLY • link 6.7 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

t first 1 million lines or lets say first 250,000 reads (small rna)

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 161k