Question: Quickest way to extract subset of reads from huge fastq file
2
gravatar for Prakki Rama
4.5 years ago by
Prakki Rama2.2k
Singapore
Prakki Rama2.2k wrote:

Hi all,

Could I please know if there is quickest way to extract reads from a huge fastq file to another. I already tried the following.

grep -A3 '1:N:0:' ORGAN1.fastq >ORGAN1.cleaned.fastq

but grep takes too long. Any oneliners from you are very much appreciated.

Thank you

Prakki Rama.

 

rna-seq unix next-gen fastq • 4.7k views
ADD COMMENTlink modified 4.5 years ago by da44da0 • written 4.5 years ago by Prakki Rama2.2k
5
gravatar for Philipp Bayer
4.5 years ago by
Philipp Bayer5.7k
Australia/Perth/UWA
Philipp Bayer5.7k wrote:

A faster way is to do this:

    LC_ALL=C fgrep -A3 '1:N:0:' ORGAN1.fastq >ORGAN1.cleaned.fastq

Here's an explanation of why this is so much faster.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Philipp Bayer5.7k
1

Wow!! Normal grep on a sample file took 17 sec, whereas LC_ALL=C just took only 4 sec. Wonderful! Thank you very much.

ADD REPLYlink written 4.5 years ago by Prakki Rama2.2k

For finding the fixed string using LC_ALL=C fgrep is very fast. But when it comes to finding regex, it is slower (although slightly faster than normal grep).

ADD REPLYlink written 4.5 years ago by Prakki Rama2.2k
2

fgrep doesn't work with regexes (that's why it's faster), could it be that it switches to egrep or grep -g for you?

ADD REPLYlink written 4.5 years ago by Philipp Bayer5.7k

Yes. You are true. It does not work for regex. I was only looking only at the time of execution. My mistake.

ADD REPLYlink written 4.4 years ago by Prakki Rama2.2k
0
gravatar for Alex Reynolds
4.5 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

If you are repeatedly querying this file, try splitting it into smaller units (say, with UNIX split), and then search through the smaller files in parallel. You could do this with, say, jobs scheduled on an SGE grid, or with GNU Parallel.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Alex Reynolds26k

Thank you Alex. But, I might not need repeatedly query the file. Grep is taking long time. sed's situation is also more or less seems same.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Prakki Rama2.2k
0
gravatar for da44da
4.5 years ago by
da44da0
da44da0 wrote:

Thanks for the information shared. It was looking on the internet.

ADD COMMENTlink written 4.5 years ago by da44da0

Try to avoid adding an answer if you're not answering the question. You can always use the "Add Comment" button below the question or below another answer if you want to.

ADD REPLYlink written 4.5 years ago by Alex Reynolds26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 755 users visited in the last hour