Question

extract reads with given size range

0

Entering edit mode

7.9 years ago

apelin20 ▴ 480

Hello,

I did a smallRNA-Seq experiment where the sequencing provider was supposed to sequence small RNAs 15-25 nt. After trimming adapters, I see that there are reads ranging in size from 15 to 35 bp, but also reads that are 50 bp (the full length of the read). Since I sent total RNA, I assume the 50bp are RNA species other than miRNA/smallRNA. I was to extract all reads <40bp and >49bp in 2 separate files.

The only way I can figure to do this is to convert to fasta, determine the length of each read using a tool, use R to create my 2 lists of reads based on sizes, save the list and use seqtk to extract from fastq. This sounds long and silly.

Any alternatives?

Thanks

RNA-Seq smallRNA-Seq fastq • 4.7k views

ADD COMMENT • link updated 7.9 years ago by iraun 6.2k • written 7.9 years ago by apelin20 ▴ 480

1

Entering edit mode

Instead you can directly extract reads which are of particular length.

and most importantly, a previous dicussion: Filtering Fastq Sequences Based On Lengths

ADD REPLY • link 7.9 years ago by venu 7.1k

1

Entering edit mode

You can also use the BBMap package like this:

reformat.sh in=reads.fq out=filtered.fq minlength=15 maxlength=25

But, you can just as easily do that at the same time as adapter trimming if you use BBDuk, which also supports those flags. Both Reformat and BBDuk are many times faster than prinseq or fastx.

ADD REPLY • link updated 7.9 years ago by GenoMax 141k • written 7.9 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2016-06-13

Have you tried awk solution to parse fastq files?

awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) < 40) {print header, seq, qheader, qseq}}' input.fq > lessthan40bp.fq

awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) > 49) {print header, seq, qheader, qseq}}' input.fq > morethan49bp.fq